Here’s a photo from one of my exercise walks in Paramin and you can see the Caribbean Sea in the distance. This is perhaps my favourite view on the walk because it means that I’ve just got to the top of a particularly brutal hill (cars sometimes struggle to get to the top and on one occasion I watched a car fail miserably in four attempts) although you can’t always see the sea as clearly as in this photo.
Molecular Design
Controlling the behavior of compounds and materials by manipulation of molecular properties.
Sunday, 6 July 2025
Assembling data sets for training ML bioactivity models
Tuesday, 1 April 2025
Property Forecast Index Validated
I visited the War Memorial on Sunday and took selfies with the Shenyang J-6 (Chinese version of MiG-19) 'liberated' by Capt. Lee Woong-pyeong when he defected to South Korea on 25th February 1983, a 'liberated' T-34 (as Uncle Joe is said to have observed, quantity has a quality all of its own) and Great Leader's car (also 'liberated' although it was not clear exactly when).
So enough of the travel photos for now and let's get back to the science. Regular readers (both of them) of this blog will be well aware of my visceral dislike for drug design metrics. One reason for this visceral dislike is that I consider these metrics to trivialise the problems faced by medicinal chemists and I remain sceptical that one can make meaningful predictions of developability or likelihood of clinical success for compounds based only on their chemical structures without knowing anything about their biological activities. One metric that I have criticised harshly in the past is property forecast index (PFI) which was originally introduced as solubility forecast index (SFI). Specifically, I denounced SFI as a ‘draw pictures and wave arms’ data analysis strategy and privately I even considered the possibility that it had been created by a toddler armed with a box of colored crayons.
Let’s take a look at the HY2010 article in which SFI was introduced. Proprietary aqueous solubility measurements (continuous variable) were first processed to assign compounds to one of three aqueous solubility categories. Histograms showing the proportions of measurements in each aqueous solubility category were created by binning values of SFI and of c log DpH7.4 and the histograms were compared visually:
This graded bar graph (Figure 9) can be compared with that shown in Figure 6b to show an increase in resolution when considering binned SFI versus binned c log DpH7.4 alone.
Recently, I have been forced to revise my negative view of PFI and I have to admit that it pains me deeply to realise that I could have been so utterly wrong for so long in my assessment of what is actually an elegant and highly-predictive drug design metric. Indeed I have now come to the conclusion that the only reason that the Journal of Medicinal Chemistry did not include PFI in its nomination for the Nobel Prize in Physiology or Medicine was that the introduction of the Ro5, LipE and Fsp3 principles led directly to so many marketed drugs being approved.
What has caused such a fundamental shift in my views? First, PFI is highlighted in the European Federation of Medicinal Chemistry (EFMC) ‘Best Practices from Hits to Lead Generation’ webinar. Now it goes without saying that EFMC includes some of the sharpest minds in medicinal chemistry and, given that they consider PFI to be sufficiently important for inclusion in a best practices webinar, it became abundantly clear that I needed to revise my hopelessly naïve thinking. Let’s join the webinar at 27:53 and you’ll see in the webinar slide that SFI (as PFI was originally introduced) has been strongly endorsed by Practical Cheminformatics, a blog that many, including me, accept without question as the source of a number of fundamental ground truths in the AI field.
However, what convinced me of the sublime elegance and extreme predictivity of PFI is a seminal study by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, working in collaboration with the Budapest Enthalpomics Group (BEG). The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (not sure exactly what she’s doing these days although she did post a photo from Pyongyang showing her and a burly chap with a toothy grin and a bizarre haircut). There is no doubt that this genuinely disruptive study will comprehensively reshape the predictive ADME landscape, enabling drug discovery scientists, for the very first time, to make accurate predictions for developability and probability of critical trial success using only chemical structures as input.
Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that graphical presentation of categorized continuous data outperforms regression analysis performed on the uncategorized continuous data. The math is truly formidable (my rudimentary understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barone-Samedi transformation, followed by hepatic eigenvalue extraction using an algorithm devised by E. V. Tooms (a reclusive Baltimore resident whose illustrious research career in analytic topology was abruptly halted almost 31 years ago by an unfortunate escalator accident). The incisive analysis of Prof. Bouchard-Duvalier shows without a shadow of doubt that the data visualization used to establish PFI as a fundamental drug design principle will reliably and robustly outperform all AI approaches to prediction of aqueous solubility. Furthermore, ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which the beaming BEG director Prof. Kígyó Olaj explains that, “Possibilities are limitless now that we can accurately and robustly predict the developability of a compound using only its chemical structure as input and we can now finally consign regression analysis to the dustbin of history. Surely the Editors of Journal of Medicinal Chemistry will recognize the impact of PFI on real world drug discovery when they make their Nobel Prize nominations later this year.”
Sunday, 9 March 2025
Thinking About Aqueous Solvation
Given that it was International Women's Day yesterday, I'll open the the post (and blogging for 2025) with a photo of a gravestone at St James' Church in Bramley (Hampshire).
In the current post I’ll be taking a look at some aspects of aqueous solvation and Richard Wolfenden’s 1983 “Waterlogged Molecules” article (W1983) is still worth reading today (as an aside, Prof Wolfenden will turn ninety in May of this year and hopefully mentioning this won't put what is called "goat mouth" in my native Trinidad and Tobago on him as I did for Oscar Niemeyer with the words "ele vive ainda" while studying Portuguese in 2012). As noted in W1983 the formation of a target-ligand complex requires partial desolvation of both target and ligand:
When biological compounds combine, react with each other, or change shape in watery surroundings, solvent molecules tend to be reorganized in the neighborhood of the interacting groups.
Formation of a target-ligand can also be seen as an “exchange reaction” and this point is very well made in SGT2012:
Molecular binding in an aqueous solvent can be usefully viewed not as an association reaction, in which only new intermolecular interactions are introduced between receptor and ligand, but rather as an exchange reaction in which some receptor–solvent and ligand–solvent interactions present in the unbound state are lost to accommodate the gain of receptor–ligand interactions in the bound complex.
In HBD3 I briefly discuss ‘frustrated hydration’ as a phenomenon that could be exploited in drug design and I’ll quote from the Summary section of W1983:
When two or more functional groups are present within the same solute molecule, their combined effects on its free energy of solvation are commonly additive. Striking departures from additivity, observed in certain cases, indicate the existence of special interactions between different parts of a solute molecule and the water that surrounds it.
I’ll try to explain how this could work for ligand design and let’s suppose that we have two polar atoms that are close together in the binding site. The proximity of the polar atoms in the binding site means that water molecules forming ideal interactions with the polar atoms in the binding sites are also likely to be close together. However, the mutual proximity of the water molecules can lead to unfavourable interactions between the water molecules which ‘frustrate’ the (simultaneous) hydration of the two polar atoms in the binding site. Now if we design a ligand with two polar atoms positioned to form good interactions with polar atoms in the binding site it is likely that these will also be in close proximity and that their hydration will be similarly frustrated. I would generally anticipate that frustration of hydration will not be handled well by implicit solvent models (RT1999 | FB2004 | CBK2008 | KF2014) or computational tools such as WaterMap that calculate energetics for individual water molecules (especially in cases where the two hydration sites cannot be simultaneously occupied).
To illustrate frustration of hydration I’ve taken a graphic from a talk from 2023. The unfavorable interactions between solvating water molecules that frustrate hydration are shown as red double-headed water molecules (in some cases these interactions will be repulsive to the extent that only one of the hydration sites can be occupied at a time). You’ll also notice two thick green lines in the right hand panel and these show secondary interactions that stabilize the bound complex. Secondary interactions of this nature were discussed in a molecular recognition context in the JP1990 study and the observation (see A1989) that pyridazine is a better hydrogen bond acceptor (HBA) than its pKa would have you believe can be seen in a similar light. Secondary interactions like these only enhance affinity when the proximal polar atoms are of the same ‘type’ (the proximal polar atoms in the 1,8-naphthyridine are both HBAs) and we should anticipate that the secondary interactions for the contact between pyrazole and the ‘hinge’ of a tyrosine kinase will be deleterious for affinity. In contrast to secondary interactions, frustration of hydration can be beneficial for affinity even when the proximal polar atoms are of opposite types, as would be the case for an HBA that is near to a hydrogen bond donor (HBD).
While it is clearly important to account for aqueous solvation when using physics-based approaches for prediction of binding affinity, passive permeability and aqueous solubility, the measurement of gas-to-water transfer free energy is not exactly routine (I’m not aware that any companies offer measurement aqueous solvation energy as a service nor do I believe that this is an activity that would readily funded). Measurements for aqueous solvation energy reported in the literature tend to be for relatively volatile compounds and I’ll direct readers to the C1981, W1981 and A1990 studies.
A view is that I've held for many years is that a partition coefficient could be used as an alternative to gas-to-water transfer free energy for studying aqueous solvation. It's also worth noting that when we think about desolvation in drug design we're often considering the energetic cost of bringing polar atoms into contact with non-polar atoms (as opposed to transferring the polar atoms to gas phase). Partition coefficient measurement is a lot more routine than solvation free energy measurement and most drug discovery scientists are of aware that the octanol/water partition coefficient (usually quoted as its base 10 logarithm logP) is an important design parameter. However, the octanol/water partition coefficient is not useful for assessing aqueous solvation because the hydroxyl group of octanol can form hydrogen bonds with solutes and the water-saturated solvent is actually quite 'wet' (the DC1992 study reports that the room temperature solubility of water in octanol is 2.5 M). If we’re going to use partition coefficient measurements for studying aqueous solvation then I would argue that we should make these measurements with a saturated hydrocarbon such as cyclohexane or hexadecane that lacks hydrogen bonding capability.
Here’s another slide from that 2023 talk showing that pyridine is lipophilic for octanol/water but hydrophilic for hexadecane/water. The difference in the logP values for a solute is sometimes referred to as ΔlogP (it is equivalent to the hexadecane/water logP value with both solvents water-saturated) and can be considered to quantify the solute’s ability to form hydrogen bonds (see Y1988 | A1994 | T2008). I'll mention in passing that ΔlogP measurements with toluene as the less polar organic solvent have been used to study intramolecular hydrogen bonding (see S2013 | C2016 | C2018).
It has long been fashionable to worry about which organic solvent (and polarity) is the best model for the lipoidal region of a particular cell membrane (Collander, 1954). These solvents have ranged from isobutanol (the most polar) to olive oil (the least polar). I have never understood the point of this. If the lipoidal region of the plasma membrane is a lipid bilayer, then clearly the appropriate model solvent is hydrocarbon. For artificial bilayers this is obviously so. I chose n-hexadecane as the particular hydrocarbon, because its chain length is comparable to that of the fatty acid residues in most phospholipids, and it is conveniently available.
I also need to mention the B2016 study (Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge) since the the cyclohexane/water distribution coefficient was used as a surrogate for gas-to-water transfer free energy in the challenge:
The inclusion of distribution coefficients replaces the previous focus on hydration free energies which was a fixture of the past five challenges (SAMPL0-4) [1 | 2 | 3 | 4 | 5 | 6 | 7]. Due to a lack of ongoing experimental work to generate new data, hydration free energies are no longer a practical property to include in blind challenges. It has become increasingly difficult to find unpublished or obscure hydration free energies and therefore impossible to design a challenge focusing on target compounds, functional groups or chemical classes.
I consider initiatives such as the SAMPL5 cyclohexane/water distribution challenge to be valuable for assessing model predictivity in an objective and transparent manner. Generally, I would avoid including logD measurements for compounds that are significantly ionized under experimental conditions because these require that account be taken of ionization when making predictions (better to measure logD at a pH at which ionizable functional groups are not significantly ionized). While challenges such as SAMPL5 are certainly valuable for assessment of predictivity of models, I consider them less useful in model development which requires measured data for structurally-related compounds.
The isosteric pairs 1/2 and 3/4 shown in the graphic below will give you an idea of what I'm getting at. The predicted pKBHX values taken from K2016 suggest that 1 is less polar than than its isostere 2 and I'd expect 3 to be more polar than 4.
While the three N-butylated purines shown in the graphic below are not strictly isosteric I would consider it valid to interpret the cyclohexane/water logP values taken from S1998 as reflecting differences in hydrogen bond acceptor strength.
This is a good point at which wrap up and, given the fundamental importance of aqueous solvation in biomolecular recognition and drug design, I see tangible advantages in having a large body of measured data in the public domain. My view is that to measure gas-to-water transfer free energy for significant numbers of compounds of interest to drug discovery scientists would be both technically demanding and unlikely to get funded although I would be delighted to be proven wrong on either point. This means that we need to learn to use other types of data in order to study aqueous solvation and my view is that an alkane/water partition coefficient would be the best option. Using alkane/water partition coefficients as an alternative to gas-to-water transfer free energies for studying aqueous solvation would also enable enthalpic (see RT1984) and volumetric aspects of aqueous solvation to be investigated more easily.
Tuesday, 31 December 2024
Natural Intelligence?
- I regard identification and biological characterisation of NPs as vital scientific activities that should be generously funded and Derek puts it very well in his recent post ("When you see specific and complex small molecules that living creatures are going to the metabolic trouble to prepare, there are surely survival-linked functions behind them."). In particular, I see it as important that NPs be screened in diverse phenotypic assays and here’s a link to the Chemical Probes Portal. While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these criticisms as indicative of an anti-NP position.
- Automation of workflows (N2017) and generation of datasets from databases such as ChEMBL are far from trivial and (33), which highlights some of the challenges faced by researchers in this area, was the subject of a recent post at Molecular Design. I consider method development in this area to be an important cheminformatic activity that should be adequately supported. It must also be stressed that the design, building and updating of databases such as ChEMBL (G2012 | B2014 | P2015 | G2017 | 23) are vital scientific activities that should be generously funded (had it not been of the vision and foresight of the creators of the PDB over half a century ago it is improbable that the 2024 Chemistry Nobel Prize would have been awarded for “computational protein design” and “protein structure prediction”). While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these as criticisms of the automated dataset generation described in the study (and recently published in H2024b) or of the contributions by a number of individuals that have made ChEMBL an invaluable resource for drug discovery scientists and chemical biologists.
9. Software. Software used as a part of computer-aided drug design should be readily available from reliable sources, and the authors should specify where the software can be obtained.
A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right, but irrelevant.
Drugs are differentiated from target comparators by higher potency, ligand efficiency (LE), lipophilic ligand efficiency (LLE), and lower carboaromaticity.
The PNP concept has been validated by its appearance in the literature (16,17) and by the design of several new classes of biologically active compounds. (18,19) [As a reviewer I would have pressed the authors to clearly articulate the “PNP concept” (just as I would have pressed the authors of this Editorial to clearly articulate the new principles that their nominees for the Nobel Prize in Physiology or Medicine had introduced). My view is that it is verging on megalomania to claim that a concept “has been validated by its appearance in the literature” and I don’t consider (18) to support the claim for “design of several new classes of biologically active compounds”. To support such a claim, one would ideally need to demonstrate that screening of libraries of compounds designed as PNPs resulted in the discovery of viable lead series against a range of therapeutic targets. At absolute minimum, one would need to show that libraries of compounds designed as PNPs exhibited exploitable activity across a range of target-related assays (although interesting, the results from the “cell painting assay” would not by themselves support a claim for “design of several new classes of biologically active compounds”). I should also mention that some in the compound quality field (see B2023 and my review of that article) interpret activity against multiple targets for a set of compounds based on a particular scaffold as evidence for pan-assay interference even when the individual compounds don’t themselves exhibit frequent-hitter behaviour. I don't have access to (19) and am therefore unable to assess the degree to which that article supports the authors claim for “design of several new classes of biologically active compounds”.]
PNP_Status. Compounds were assigned to one of four categories according to their NP fragment combination graphs. (16,17) The NP library fragments used for this purpose are Murcko scaffolds (26) [It would be actually more appropriate to refer to these as ‘Bemis scaffolds’ in order to properly recognize the corresponding author of this article.] (the core structures containing all rings without substituents except for double bonds, n = 1673) derived (16) from a representative set of 2000 NP fragment clusters. (15) [I see this approach as unlikely to capture all the relevant cyclic substructures present in NPs. My view is that it would have been better to first extract the relevant cyclic substructures from the chemical structures of all NPs for which this information is available, and then do the selection and filtering in one or more subsequent steps. The other advantage of doing things this way is that you’ll get a better assessment of the frequencies with which the different cyclic substructures occur in the chemical structures of NPs.] Because of their ubiquitous appearances in NPs, the phenyl ring and glucose moieties were specifically excluded as fragments. (16) [I would expect exclusion of the benzene ring (I consider ‘benzene ring’ more correct than ‘phenyl ring’ in this context) as a fragment to result is a significant reduction in number of the number of compounds that are considered to be PNPs (and, by implication, the ‘enrichment’ associated with membership of the PNP class). Even though the benzene ring has been excluded for the purpose of assigning PNP status it should still be considered to be one of Nature’s building blocks.]
This is further evidence that the three NP metrics can be considered as independent measures of clinical compound quality. [I would consider the claim that any of these “NP metrics” can be considered as a measure of“clinical compound quality” to be wildly extravagant (the authors haven't even stated how "clinical compound quality" is defined yet they claim to be able to measure it). I would argue that compound quality cannot be meaningfully compared for clinical compounds that have been developed for different diseases or disorders. Describing a compound as 'clinical' implies that a large body of measured data has actually been generated for it and the authors of H2024 might find it instructive to ask themselves why they think a simple metric calculated from the chemical structure of the compound would be of interest to a project team with access to this large of body of measured data One criticism that I make of drug discovery metrics is that they trivialize drug discovery and we noted in KM2013: “Given that drug discovery would appear to be anything but simple, the simplicity of a drug-likeness model could actually be taken as evidence for its irrelevance to drug discovery.” ]
The overall results are supportive of the occurrence of “natural selection” being associated with many successful drug discovery campaigns. [My view is that the authors of H2024 have not clearly articulated what they mean by“natural selection” in the context of this study.] It has been proposed that NP-likeness assists drug distribution by membrane transporters, (21) [The author of (20c) asserts "Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low" and, by implication, that entry of the vast majority of drugs into cells is transporter mediated. I keep an open mind on this issue although I note that what is touted by some as a universal phenomenon does seem to have been remarkably difficult to observe directly by experiment. The difficulties caused by active efflux are widely recognized by drug discovery scientists and it may be instructive for the authors of H2024 to consider how an experienced medicinal chemist working in the CNS area might view a suggestion that compounds should be made more like NPs to increase the likelihood of being transporter substrates.] and we further speculate that employing NP fragments may result in less attrition due to toxicity, a major cause of preclinical failure. (55) [This does seem to be grasping at straws. The focus of the cited article is actually clinical failure and not preclinical failure.]
There is untapped potential for further exploitation of currently used and unused NP fragments, especially in fragment combinations and the design of PNPs, without the need to resort to chemically diverse ring systems and scaffolds. [This exemplifies what can be called the ‘Ro5 mentality’ (‘experts’ advising medicinal chemists to not explore but to focus on regions of chemical space that have been blessed by the ‘experts’). As I note in this blog post Ro5 (as it is stated) is not actually supported by data and in NoLE, I advise drug designers not to “automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.” An equally plausible 'explanation' for the observation that a high fraction of clinical compounds are PNPs is simply that medicinal chemists are working with what they're most familiar with (in this case the advice would be to look beyond Nature's building blocks for inspiration).] To exploit these opportunities, “NP awareness” needs to be added to the repertoire of medicinal chemists. [My view is that it would be more important for critical thinking to be added to the repertoire of medicinal chemists so they are better equipped to assess the extent to which conclusions and recommendations of studies like H2024 are actually supported by data.]
In short, applying nature’s building blocks─natural intelligence─to drug design can enhance the opportunities now offered by artificial intelligence. [In my view "natural intelligence" appears to be arm-waving that is neither natural nor intelligent.]
Sunday, 20 October 2024
Assessment of AI-generated chemical structures using ML
Previous << || >> Next
In an earlier post I considered what it might mean to describe drug design as AI-based. In this post I’ll take a general look at using machine learning (ML) to predict biological activity (and other pharmaceutically-relevant properties) for AI-generated chemical structures. Whether or not ML models ultimately prove to be fit for this purpose it is worth pointing out that many visionaries and thought leaders who tout computation as a panacea for humanity’s ills fail to recognize the complexity of biology (take a look at In The Pipeline posts from 2007 | 2015 | 2024). One point worth emphasizing in connection with the complexity of biology is that it is not currently possible to measure the concentration of a drug at its site of action for intracellular targets in live humans (here's an article on intracellular and intraorgan drug concentration that I recommend to everybody working in drug discovery and chemical biology). While I won't actually be saying anything about AI (here's a recent post from In The Pipeline that takes a look at how things are going for early movers in the field of AI drug discovery) in the current post I'll reiterate the point with which I concluded the earlier post:
One error commonly made by people with an AI/ML focus is to consider drug design purely as an exercise in prediction while, in reality, drug design should be seen more in a Design of Experiments framework.
In that earlier post I noted that there’s a bit more to drug design than simply generating novel molecular structures and suggesting how the compounds should be synthesized. While I'm certainly not denying the challenges presented by the complexity of biology the current post will focus on some of the challenges associated with assessing chemical structures churned out by generative AI. One way of doing this is to build models for predicting biological activity and other pharmaceutically relevant properties such as aqueous solubility, permeability and metabolic stability. This is something that people have been trying to do for many years and the term ‘Quantitative Structure-Activity Relationship’ (QSAR) has been in use for over half a century (the inaugural EuroQSAR conference was held in Prague in 1973 a mere five years after Czechoslovakia had been invaded by the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). My view is that many of the ML models that get built with drug design in mind could accurately be described as QSAR models and I would not describe QSAR models as AI.
In the current post, I'll be discussing ML models for predicting quantities such as potency, aqueous solubility and permeability that are continuous variables which I refer to as 'regression-based ML models' (while some readers will not be happy with this label I do need to make it absolutely clear that the post is about one type of ML model and the label 'QSAR-like' could also have been used). I’ll leave classification models for another post although it’s worth mentioning that genuinely categorical data are actually rare in drug discovery (you should always be wary of gratuitous categorization of continuous data since this is a popular way to disguise the weakness of trends and KM2013 will give you some tips on what to look out for). It also needs to be stressed that the ML is a very broad label and that utility in one area (prediction of protein-folding for example) doesn't mean that that ML models will necessarily prove useful in other area.
To build a regression-based ML model you first need to assemble a training set of compounds for which the appropriate measurements have been made and pIC50 values are commonly used to quantify biological activity (I recommend reading the LR2024 study on combining results from different assays although, as discussed in this post, I don’t consider it meaningful to combine data from multiple pairs of assays when calculating correlation-based metrics for assay compatibility). Next, you calculate values of descriptors for the chemical structures of the compounds in your training set (descriptors are typically derived from the connectivity in the chemical structure although atom counts and predicted values of physicochemical properties are also used). Finally, you use the ML modelling tools to find a function of the descriptors that best predicts the biological activity (or a pharmaceutically-relevant property) for the compounds in the training set. Generally you should also validate your models and this is especially important for models with large numbers of adjustable parameters.
There appears to be a general consensus that you need plenty of data for building ML models and some will even say “quantity has a quality all of its own” (this is sometimes stated as Stalin’s view of the T-34 tank although I consider this unlikely and the T-34 was actually an excellent tank which also happened to get produced in large numbers). Most people building regression-based ML models are also aware that you need a sufficiently wide spread in the measured data used for training the model (the variance in the measured data should be large in comparison with the precision of the measurement). Lead optimization is typically done within structural series and building a regression-based ML model that is predictively useful is likely to require data that have been measured for compounds in the structural series of interest. These data requirements are quite stringent and I see this as one reason that QSAR approaches do not appear to have had much impact on the discovery of drugs despite the drug discovery literature being awash with QSAR articles. Back in 2009 (see K2009) I compared prediction-driven drug design with hypothesis-driven drug design, noting that the former is often not viable and that the latter is more commonly used in pharmaceutical and agrochemical discovery (former colleagues discussed hypothesis-driven molecular design in the context of the design-make-test-analyse cycle in the P2012 article).
There are some other points that you need to pay attention to when building regression-based ML models. First, replicate measurements for the response variable (the quantity that you’re trying to predict) should be normally distributed and this is one reason why we model pIC50 rather than IC50. Second, the data values for the training set should be uniformly distributed in the descriptor space (my view, expressed in B2009, is that many 'global' predictive models are actually ensembles of local models). Third, the descriptors should not be strongly correlated or the method used to build the regression-based ML model must be able to account for relationships between descriptors (while it’s relatively straightforward to handle linear relationships between descriptors in simple regression analysis it’s not clear how effectively this can be achieved with more sophisticated algorithms used for building regression-based ML models).
I’ve created a graphic (Figure 1) to illustrate some of the modelling difficulties that result from uneven coverage in the descriptor space and it goes without saying that reality will be way more complex. The entities that occupy this chemical space are compounds and the coordinates of a point show the values of the descriptors X1 and X2 that have been calculated from the corresponding chemical structures (the terms ‘2D structure’ and ‘molecular graph’ also used). I’ve depicted real compounds for which measured data are available as black circles and virtual compounds (for which predictions are to be made) as five-pointed stars. The clusters (color-coded but also labelled A, B and C in case any readers are colour blind) are much more clearly defined than would be the case in a real chemical space. Proximity in chemical space implies similarity between compounds and the clusters might correspond to three different structural series.
Let’s suppose that we’ve been able to build a useful local model to predict pIC50 for each cluster even though we’ve not been able to build a predictively useful global model. Under this scenario you’d have a relatively high degree of confidence in the pIC50 values predicted for the virtual compounds (depicted as five-pointed stars) that lie within the clusters and a much lower degree of confidence in the virtual compound that is indicated by the arrow. If, however, we were to ignore the structure of the data and take a purely global view then we would conclude that the virtual compound indicated by the arrow occupied a central location in this region of chemical space and that the other three virtual compounds occupied peripheral locations. Put another way, the applicability domain of the model is not a single contiguous region of chemical space and what would appear to be an interpolation by a model is actually an extrapolation.
It is important to take account of correlations between descriptors when building prediction models. A commonly employed tactic is to perform principal component analysis (PCA) which generates a new set of orthogonal descriptors and also provides an assessment of the dimensionality of the descriptor space. There are also ways to deal with correlations between descriptors in the model building process (PLS is the best known of these and the K1999 review might also be of interest). Correlations between descriptors also complicate interpretation of ML models and my stock response to any claim that an ML model is interpretable would be to ask how relationships between descriptors had been accounted for in the modelling of the data. An excellent illustrative example (see L2012) of a correlation between descriptors is the tendency of the presence of a basic nitrogen in a chemical structure to be associated with higher values of the Fsp3 descriptor (which, as pointed out in this post, should really be referred to as the I_ALI descriptor).
Let’s take another look at Figure 1. The axes of the ellipse representing Cluster A are aligned with the axes of the figure which tells us that X1 and X2 are uncorrelated for the compounds in this cluster. Cluster B is also represented by an ellipse although its axes are not aligned with the axes of the figure which implies a linear correlation between X1 and X2 for the compounds in this cluster (you can use PCA to create two new orthogonal descriptors by rotating the plot around an axis that is perpendicular to the X1-X2 plane). Cluster C is a bigger problem because the correlation between X1 and X2 is non-linear (the cluster is not represented as an ellipse) and it would be rather more difficult to generate two new orthogonal descriptors for the compounds in this cluster. My view is that PCA is less meaningful when there is a lot of clustering in data sets and I would also question the value of PLS and related methods in these situations.
Let’s consider another scenario by supposing that we’ve been unable to build a useful local model for prediction of any of the three clusters in Figure 1. If, however, the average pIC50 values differ for each of the three clusters we can still extract some predictivity from the data by finding a function of X1 and X2 that correlates with the average pIC50 values for the clusters. This is one way that clustering of compounds in the descriptor space can trick you into thinking that a global model has a broader applicability domain than is actually the case. Under this scenario it would be very unwise to try to interpret the model or use it to make predictions for compounds that sit outside the clusters.
This is a good point at which to wrap up my post on regression-based ML (or QSAR-like if you prefer) models for predicting biological activity and other properties relevant to drug design such as aqueous solubility, permeability and metabolic stability. There appears to be a general consensus that building these models requires a lot of data and, in my view, this means that models like these are actually of limited utility in real world drug design. The basic difficulty is that a project team with enough data for building useful regression-based ML models is likely to be at a relatively advanced stage (the medicinal chemists will already understand the structure-activity relationships and be aware of project-specific issues such as poor aqueous solubility or high turnover by metabolic enzymes). Drug discovery scientists tend to be less aware of the problems that arise from clustering of compounds in descriptor space and, in my view, this is a factor that should be considered by those seeking to assemble data sets for benchmarking (see W2024). I'll leave you with a suggestion (it was considered a terrible idea at the time and probably still is by most ML thought leaders) I made over twenty years ago that each predicted value should be accompanied by chemical structures and measured values for the three closest neighbours in the descriptor space of the model.