Tuesday, 31 December 2024

Natural Intelligence?

**********************************************************
My pulse will be quickenin'
With each drop of strychnine
We feed to a pigeon
It just takes a smidgin
To poison a pigeon in the park

Tom Lehrer, Poisoning Pigeons in the Park | video
*********************************************************

I’ll be reviewing the H2024 study (Occurrence of “Natural Selection in Successful Small Molecule Drug Discovery) in this post. Derek has already posted on the H2024 study which has been included in the BL2024 Virtual Special Issue on natural products (NPs) in medicinal chemistry. I'll also mention reviews here at Molecular Design of the the related studies (4) (see post) and (24) (see post). As is usual for Molecular Design reviews of literature I have used the same reference numbers that were used in H2024 and quoted text is indented with any comments by me in square brackets and italicised in red. Given the serious concerns I have about H2024 this is going to be a long post and there are a couple of disclaimers that I need to make before starting the review:

  1. I regard identification and biological characterisation of NPs as vital scientific activities that should be generously funded and Derek puts it very well in his recent post ("When you see specific and complex small molecules that living creatures are going to the metabolic trouble to prepare, there are surely survival-linked functions behind them."). In particular, I see it as important that NPs be screened in diverse phenotypic assays and here’s a link to the Chemical Probes Portal. While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these criticisms as indicative of an anti-NP position.
  2. Automation of workflows (N2017) and generation of datasets from databases such as ChEMBL are far from trivial and (33), which highlights some of the challenges faced by researchers in this area, was the subject of a recent post at Molecular Design. I consider method development in this area to be an important cheminformatic activity that should be adequately supported. It must also be stressed that the design, building and updating of databases such as ChEMBL (G2012 | B2014 | P2015 | G2017 | 23) are vital scientific activities that should be generously funded (had it not been of the vision and foresight of the creators of the PDB over half a century ago it is improbable that the 2024 Chemistry Nobel Prize would have been awarded for “computational protein design” and “protein structure prediction”). While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these as criticisms of the automated dataset generation described in the study (and recently published in H2024b) or of the contributions by a number of individuals that have made ChEMBL an invaluable resource for drug discovery scientists and chemical biologists.

Hampi, November 2013

Having made the disclaimers, I’ll open my review of H2024 with some general observations. First, I do not consider that H2024 presents any insights of practical value to medicinal chemists nor do I consider the analyses presented in the study to support the assertion that “there is untapped potential awaiting exploitation, by applying nature’s building blocks─’natural intelligence’─to drug design” (in my view the use of the term “natural intelligence” does rather endow the study with what I’ll politely refer to as a distinctly pastoral odour). Second, the results of the analyses presented in H2024 do not demonstrate any tangible benefits from the drug design perspective of incorporating structural features that have been anointed as 'natural' by the authors (my view is that it would be extremely difficult to design data analyses to address the relevant questions in an objective manner). Third, the authors of H2024 present a ‘scaffold-centric’ view of NPs in which the naturalness of NPs is due to cyclic substructures present within their chemical (2D) structures (it is almost as if these 'natural' substructures are considered to be infused with 'vital force') and I would question whether this is a realistic view from the molecular recognition and physicochemical perspectives.  Fourth, the meaning of what the authors of H2024 are calling 'enrichment' of pseudo-NPs (PNPs) in clinical compounds is unclear and, in any case, the 'enrichment' values do seem rather low (never more than twofold) when you consider the numbers of compounds that successful discovery project teams typically have to synthesize in order to deliver a drug that gets to market.

It's not clear (at least to me) what the authors of H2024 mean by ‘natural selection’ and at times their view of natural selection appears to be closer to Lysenkoism than Darwinism. For example, they assert in the conclusions section of H2024 that “NP structural motifs are provided predesigned by nature, constructed for biological purposes as a result of 4 billion years of evolution.” Design actually has no place in natural selection and perhaps the authors are thinking of 'Intelligent Design' which is a doctrine with many adherents in the Creationist community.  While I don’t dispute that the chemical structures of many clinical compounds contain substructures that are also found in the chemical structures of NPs, I think that it would be extremely difficult to objectively compare different explanations for the observations (it's worth remembering that correlation does not imply causation). The explanation favoured by the authors of H2024 is that compounds assembled from Nature’s building blocks are ‘better’ and a stated aim of the study is “to seek further support for the existence of ‘natural selection’ in drug discovery” (this video will give readers an idea of what the late great Dave Allen might have made of this). In my view the data analyses presented in H2024 are not actually based on statistics and are therefore unfit for the purpose of testing hypotheses. Put another way, if you're going to use data analysis to look for something then it would be a good idea to use methods capable of telling you that you that haven't found what you were looking for.    
 
The data analyses in H2024 are largely based on quantities (PNP_Status | Frag_coverage_Murcko | NP-likeness) that are calculated from the chemical (2D) structures of compounds.  However, the authors do not state which software was used to perform the calculations and, had I been a reviewer, I would have drawn their attention to the following directive in the Data Requirements section in the J Med Chem Author Guidelines (accessed 27-Dec-2024):

9. Software. Software used as a part of computer-aided drug design should be readily available from reliable sources, and the authors should specify where the software can be obtained.

As was the case for my review of (24) I see much of the analysis in H2024 as relatively harmless “stamp collecting” (in contrast, as discussed in KM2013, I consider presentations and analyses of data that exaggerate trend strength, such as those used in the HMO2006LS2007, LBH2009, HY2010 and TY2020 studies to be anything but harmless). The analyses that I’ll be examining in this post are of comparisons between clinical compounds and reference compounds although I'll comment in general terms on the analyses of time-dependencies of characteristics of clinical compounds. My general criticism of H2024 is not that the analyses presented by its authors are necessarily invalid but that they fail to provide any useful insight and I’ll share an insightful observation by Manfred Eigen (1927-2019):

A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right, but irrelevant.

I first encountered analyses of time-dependencies of drug properties about two decades ago and rapidly came to the conclusion that some senior medicinal chemists where I worked had a bit too much time on their hands.  The fundamental flaw in the interpretation of these analyses is that time-dependencies of the properties of drugs and other clinical compounds are presented as causes rather than effects and it has never been clear how medicinal chemists working on drug discovery projects in the real world should use the results from such analyses. The authors claim that “changes to drug properties over time are significant” and I would challenge them to present even a single example of such analysis being used to meaningfully inform decision-making in a drug discovery project. It must be stressed that my criticism of analyses of time-dependency of the properties of drugs and other clinical compounds is simply that they don't provide useful insights and not that the analyses are necessarily invalid. That said, I do have general concerns about how time-dependencies are compared when some of the properties are expressed as logarithms and some are not. As reviewer I would have recommended that the vertical axis of the plot in the graphical abstract be drawn from 0% to 100% rather than from 30% to ~67%.

As is the case for analyses of time-dependency, my criticism of analyses of the differences between clinical compounds and reference compounds is that they don’t provide useful insight and there is no suggestion that the analyses are necessarily invalid. Before looking at the analyses presented in H2024 I’ll quote from the abstract of (24) because this will give you an idea of what I mean by analyses not providing useful insight:

Drugs are differentiated from target comparators by higher potency, ligand efficiency (LE), lipophilic ligand efficiency (LLE), and lower carboaromaticity.

As I noted in this post (this focused principally on the invalidity of the LE metric as discussed in NoLE) reporting that an analysis has shown drugs to be differentiated by potency from target comparators does seem to be stating the obvious and, given how LE and LLE are defined, it is perhaps not the most penetrating of insights to observe that values of these efficiency metrics tend to be greater for drugs than for comparator compounds. While the observation of lower carboaromaticity of drugs relative to comparator compounds is non-obvious, it does not constitute information that can be used for medicinal chemistry decision-making in specific discovery projects (as we noted in KM2013 carboaromaticity and lipophilicity can both be reduced simply by replacing a benzene ring with benzoquinone).

Let’s take a look at how this type of analysis is used in H2024. The authors of H2024 note that “comparing Figure 3a,b shows a clear ‘enrichment’ of PNPs in clinical compounds versus reference compounds in the post-2008 period” and two of these authors, writing in (17), assert that “PNPs have increasingly been explored in recent drug discovery programs, and are strongly enriched in clinical compounds”.  What the authors of H2024 are calling 'enrichment' is rather different to the enrichment in structural features that results from high-throughput screening (HTS) and it’s important to understand the difference. Let’s suppose that we’ve screened a library of compounds of which 1% are pyrimidines and 1% are pyrazines and we find that 10% of the hits are pyrimidines and 0.1% are pyrazines (to simplify things you can assume there is no compound in the library with a pyrimidine and a pyrazine in its chemical structure). In this case we would conclude that the process of screening has resulted in a tenfold enrichment for pyrimidines and a tenfold impoverishment for pyrazines. Now let's create a 'selected azines' category by combining the pyrimidines and pyrazines which as a structural class comprise 2% of the screening library compounds but 10.1% of the hits. What I'm getting at here is that enrichment of an more inclusive structural class such as 'selected azines' (or PNPs) does not imply that each and every one of the structural classes covered by the inclusive structural class definition will also be enriched.

Now let’s take a look at how the 'enrichment' of PNPs in clinical compounds is assessed in H2024. First, a set of reference compounds is generated for each clinical compound (this is discussed in detail in H2024b) and the sets of reference compounds are combined. 'Enrichment' is then assessed by comparing the fraction of clinical compounds that are PNPs with the fraction of compounds in the combined reference sets that are PNPs. When we assess enrichment of chemotypes in HTS the hits are all selected (by the screening process) from the same reference pool of compounds. In contrast, each clinical compound in the H2024 analysis is associated with a different reference set of compounds (from the perspective of data analysis combining reference sets defined in this manner gratuitously throws information away). As a reviewer I would have pressed the authors to enlighten readers as to how they should interpret the proportions of PNPs in the reference sets for individual compounds.

It's worth thinking about what the reference compound set might look like for a clinical compound that is a PNP. The proportion of PNPs in the reference set will generally be influenced by factors such as availability of data, the ‘rarity’ of the structural features of the drug and the ‘tightness’ of the structure-activity relationship (SAR).  A more permissive definition of ‘activity’ would generally be expected to make SAR appear to be less ‘tight’ (or ‘looser’ if you prefer). Compounds were defined as ‘active’ for the analysis on the basis of a recorded pChEMBL value against one of the clinical compound’s targets (as a reviewer I’d have suggested that the authors define the term ‘pChEMBL’) which means that a compound might have been selected for inclusion in a reference set on the basis of an IC50 value of 100 μM.

Let’s define 'enrichment' by dividing the fraction of the clinical compounds that are PNPs by the fraction of reference compounds that are PNPs. When we select a reference set for a clinical compound that is a PNP then it’s extremely unlikely that every single compound in the reference set will also be a PNP (especially if we’re accepting compounds with IC50 values 100 μM as ‘active’) and it’s even less likely that every single compound in the combined reference sets will be a PNP. This means that we should generally expect the clinical compounds that are PNPs to be ‘enriched’ in PNPs when compared with their combined reference sets. We can apply exactly the same logic to conclude that we should expect that the combined reference sets for the clinical compounds that are not PNPs  (under this scenario we would conclude that the set of clinical compounds that are not PNPs are infinitely impoverished in PNPs when compared with their combined reference sets). This means that we should expect that the 'enrichment' of PNPs in the clinical compound set in comparison with their combined reference sets will increase with the fraction of clinical compounds that are PNPs.

Let’s take another look at the plot in the graphical abstract which shows the fractions of clinical compounds and reference compounds that are PNPs as a function of time. Notice how the lines tend to be furthest apart when the fraction of clinical compounds that are PNPs is relatively high. As a reviewer, I would have required that the authors examine the correlation between the logarithm of the fraction of clinical compounds and the logarithm of the enrichment (a relatively strong correlation would indicate that the information added by the combined reference sets is minimal). The 'enrichments' calculated from the plot in the graphical abstract are underwhelming (the highest degree of enrichment is the 2014 value of just over 1.5-fold and this value seems very low when you consider the numbers of compounds that successful discovery project teams typically need to synthesize in order to get drugs approved).  From 2011 the fraction of clinical compounds that are PNPs exceeds 50% but I wouldn't consider it accurate to use the term "strongly enriched" (17) because the fraction of reference compounds that are PNPs is 40% or greater for this time period (plotting the vertical axis in the graphical abstract from 30% to ~67%  creates the illusion that the 'enrichment' is greater than it actually is).

I do have a number of other gripes about the data analysis in H2024 but I do also need to take a look at PNPs and the following assertion by the authors is an appropriate point at which to start this discussion:

The PNP concept has been validated by its appearance in the literature (16,17) and by the design of several new classes of biologically active compounds. (18,19) [As a reviewer I would have pressed the authors to clearly articulate the “PNP concept” (just as I would have pressed the authors of this Editorial to clearly articulate the new principles that their nominees for the Nobel Prize in Physiology or Medicine had introduced).  My view is that it is verging on megalomania to claim that a concept “has been validated by its appearance in the literature” and I don’t consider (18) to support the claim for “design of several new classes of biologically active compounds”. To support such a claim, one would ideally need to demonstrate that screening of libraries of compounds designed as PNPs resulted in the discovery of viable lead series against a range of therapeutic targets. At absolute minimum, one would need to show that libraries of compounds designed as PNPs exhibited exploitable activity across a range of target-related assays (although interesting, the results from the “cell painting assay” would not by themselves support a claim for “design of several new classes of biologically active compounds”). I should also mention that some in the compound quality field (see B2023 and my review of that article) interpret activity against multiple targets for a set of compounds based on a particular scaffold as evidence for pan-assay interference even when the individual compounds don’t themselves exhibit frequent-hitter behaviour. I don't have access to (19) and am therefore unable to assess the degree to which that article supports the authors claim for “design of several new classes of biologically active compounds”.]

The PNP status of a compound is determined by how “NP library fragments” (these are cyclic substructures extracted from the chemical structures of compounds in an NP-focussed screening library that had been generated over a decade ago for fragment-based drug discovery) are combined in its chemical structure.
 
PNP_Status. Compounds were assigned to one of four categories according to their NP fragment combination graphs. (16,17) The NP library fragments used for this purpose are Murcko scaffolds (26) [It would be actually more appropriate to refer to these as ‘Bemis scaffolds’ in order to properly recognize the corresponding author of this article.] (the core structures containing all rings without substituents except for double bonds, n = 1673) derived (16) from a representative set of 2000 NP fragment clusters. (15) [I see this approach as unlikely to capture all the relevant cyclic substructures present in NPs.  My view is that it would have been better to first extract the relevant cyclic substructures from the chemical structures of all NPs for which this information is available, and then do the selection and filtering in one or more subsequent steps. The other advantage of doing things this way is that you’ll get a better assessment of the frequencies with which the different cyclic substructures occur in the chemical structures of NPs.]  Because of their ubiquitous appearances in NPs, the phenyl ring and glucose moieties were specifically excluded as fragments. (16) [I would expect exclusion of the benzene ring (I consider ‘benzene ring’ more correct than ‘phenyl ring’ in this context) as a fragment to result is a significant reduction in number of the number of compounds that are considered to be PNPs (and, by implication, the ‘enrichment’ associated with membership of the PNP class).  Even though the benzene ring has been excluded for the purpose of assigning PNP status it should still be considered to be one of Nature’s building blocks.]

As I mentioned earlier in the post, the view of NPs presented in H2024 is ‘scaffold-centric’ and I would question how realistic this view is given that non-scaffold atoms at the periphery of a molecular structure will generally be more exposed to targets (and anti-targets) than scaffold atoms at the core of the molecular structure. What I’m getting at here is that it is far from clear how much of a compound’s pharmacological activity can be attributed to the presence of individual substructural features in the chemical structure of the compound (modifying a point made in NoLE, I would argue that the contribution of a structural feature to the binding affinity of a compound is not actually an experimental observable). This is one reason that unless matched molecular pairs are available it would not generally be possible to demonstrate the superiority of one structural feature over another in an objective manner.

Something that you need to pay very close attention to when extracting substructures from chemical structures of compounds is the ‘environment’ of the substructure (I prefer to use the term ‘substructural context’). For example, two piperidine rings linked through nitrogen look very different from the perspective of a therapeutic target protein depending on whether the link is a carbonyl carbon or a tetrahedral carbon (most medicinal chemists will be aware that the protonation states differ but there are also subtle, although still significant, differences in the shape of the piperidine ring in the two substructures). You also need to be aware that fusing rings can have profound effects on physicochemical characteristics and I would consider it a bad idea to extract monocyclic substructures from fused or bicyclic ring systems.

There are some things that don't look quite right and I would have flagged these up if I’d been reviewing the manuscript. Let’s take a look at the first entry (Sotorasib) in Table 1 and you can see that the oxygen of the 2-pyrimidone substructure is coloured lilac indicating that this substructure can be found in the chemical structures of one or more NPs (I would still challenge the view that the result of fusing 2-pyrimidone with pyridine should be considered 'natural' on the basis that the heterocycles from which it is derived from are both found in chemical structures of NPs). Now take a look the second entry (Dolutegravir) in Table 1 and you'll notice that the oxygen in the 4-pyridone substructure is not coloured green. This implies that 4-pyridone does not occur in the chemical structure of any NP and, in the absence of  information, I can only assume that it has been anointed as 'natural' because of its structural analogy with pyridine (while there is a nitrogen atom and five trigonal carbon atoms in each substructure the molecular recognition characteristics of the two substructures differ far too much for them to be regarded as equivalent from the perspective of assigning PNP status). Six of the substructures in Figure 5 appear to be in unstable tautomeric forms (first, fifth, ninth, twelfth entries in line 2 | seventh entry in line 3 | first entry in line 5).   

I'll conclude my review of  H2024 by commenting on claims made by the authors:

This is further evidence that the three NP metrics can be considered as independent measures of clinical compound quality. [I would consider the claim that any of these “NP metrics” can be considered as a measure of“clinical compound quality” to be wildly extravagant (the authors haven't even stated how "clinical compound quality" is defined yet they claim to be able to measure it). I would argue that compound quality cannot be meaningfully compared for clinical compounds that have been developed for different diseases or disorders. Describing a compound as 'clinical' implies that a large body of measured data has actually been generated for it and the authors of H2024 might find it instructive to ask themselves why they think a simple metric calculated from the chemical structure of the compound would be of interest to a project team with access to this large of body of measured data One criticism that I make of drug discovery metrics is that they trivialize drug discovery and we noted in KM2013: “Given that drug discovery would appear to be anything but simple, the simplicity of a drug-likeness model could actually be taken as evidence for its irrelevance to drug discovery.” ]

The overall results are supportive of the occurrence of “natural selection” being associated with many successful drug discovery campaigns. [My view is that the authors of H2024 have not clearly articulated what they mean by“natural selection” in the context of this study.]  It has been proposed that NP-likeness assists drug distribution by membrane transporters, (21) [The author of (20c) asserts "Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low" and, by implication, that entry of the vast majority of drugs into cells is transporter mediated. I keep an open mind on this issue although I note that what is touted by some as a universal phenomenon does seem to have been remarkably difficult to observe directly by experiment. The difficulties caused by active efflux are widely recognized by drug discovery scientists and it may be instructive for the authors of H2024 to consider how an experienced medicinal chemist working in the CNS area might view a suggestion that compounds should be made more like NPs to increase the likelihood of being transporter substrates.] and we further speculate that employing NP fragments may result in less attrition due to toxicity, a major cause of preclinical failure. (55[This does seem to be grasping at straws. The focus of the cited article is actually clinical failure and not preclinical failure.]

There is untapped potential for further exploitation of currently used and unused NP fragments, especially in fragment combinations and the design of PNPs, without the need to resort to chemically diverse ring systems and scaffolds. [This exemplifies what can be called the ‘Ro5 mentality’ (‘experts’ advising medicinal chemists to not explore but to focus on regions of chemical space that have been blessed by the ‘experts’). As I note in this blog post Ro5 (as it is stated) is not actually supported by data and in NoLE, I advise drug designers not to “automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.” An equally plausible 'explanation' for the observation that a high fraction of clinical compounds are PNPs is simply that medicinal chemists are working with what they're most familiar with (in this case the advice would be to look beyond Nature's building blocks for inspiration).] To exploit these opportunities, “NP awareness” needs to be added to the repertoire of medicinal chemists. [My view is that it would be more important for critical thinking to be added to the repertoire of medicinal chemists so they are better equipped to assess the extent to which conclusions and recommendations of studies like H2024 are actually supported by data.]

In short, applying nature’s building blocks─natural intelligence─to drug design can enhance the opportunities now offered by artificial intelligence. [In my view "natural intelligence" appears to be arm-waving that is neither natural nor intelligent.]  

This is a good point to wrap up and to also conclude blogging for the year. My new year wish is for a kinder, happier and more peaceful World in 2025 and I'll leave you with a photo of BB and Coco in the study here in Maraval. They had been helping me with this post before I unwisely decided to explain ligand efficiency to them. Let sleeping dogs lie I guess.


 

Sunday, 20 October 2024

Assessment of AI-generated chemical structures using ML

Previous << || >> Next  

In an earlier post I considered what it might mean to describe drug design as AI-based. In this post I’ll take a general look at using machine learning (ML) to predict biological activity (and other pharmaceutically-relevant properties) for AI-generated chemical structures. Whether or not ML models ultimately prove to be fit for this purpose it is worth pointing out that many visionaries and thought leaders who tout computation as a panacea for humanity’s ills fail to recognize the complexity of biology (take a look at In The Pipeline posts from 2007 | 2015 | 2024). One point worth emphasizing in connection with the complexity of biology is that it is not currently possible to measure the concentration of a drug at its site of action for intracellular targets in live humans (here's an article on intracellular and intraorgan drug concentration that I recommend to everybody working in drug discovery and chemical biology). While I won't actually be saying anything about AI (here's a recent post from In The Pipeline that takes a look at how things are going for early movers in the field of AI drug discovery) in the current post I'll reiterate the point with which I concluded the earlier post:

One error commonly made by people with an AI/ML focus is to consider drug design purely as an exercise in prediction while, in reality, drug design should be seen more in a Design of Experiments framework.  

In that earlier post I noted that there’s a bit more to drug design than simply generating novel molecular structures and suggesting how the compounds should be synthesized. While I'm certainly not denying the challenges presented by the complexity of biology the current post will focus on some of the challenges associated with assessing chemical structures churned out by generative AI. One way of doing this is to build models for predicting biological activity and other pharmaceutically relevant properties such as aqueous solubility, permeability and metabolic stability. This is something that people have been trying to do for many years and the term ‘Quantitative Structure-Activity Relationship’ (QSAR) has been in use for over half a century (the inaugural EuroQSAR conference was held in Prague in 1973 a mere five years after Czechoslovakia had been invaded by the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). My view is that many of the ML models that get built with drug design in mind could accurately be described as QSAR models and I would not describe QSAR models as AI.

In the current post, I'll be discussing ML models for predicting quantities such as potency, aqueous solubility and permeability that are continuous variables which I refer to as 'regression-based ML models' (while some readers will not be happy with this label I do need to make it absolutely clear that the post is about one type of ML model and the label 'QSAR-like' could also have been used). I’ll leave classification models for another post although it’s worth mentioning that genuinely categorical data are actually rare in drug discovery (you should always be wary of gratuitous categorization of continuous data since this is a popular way to disguise the weakness of trends and KM2013 will give you some tips on what to look out for). It also needs to be stressed that the ML is a very broad label and that utility in one area (prediction of protein-folding for example) doesn't mean that that ML models will necessarily prove useful in other area.      

To build a regression-based ML model you first need to assemble a training set of compounds for which the appropriate measurements have been made and pIC50 values are commonly used to quantify biological activity (I recommend reading the LR2024 study on combining results from different assays although, as discussed in this post, I don’t consider it meaningful to combine data from multiple pairs of assays when calculating correlation-based metrics for assay compatibility). Next, you calculate values of descriptors for the chemical structures of the compounds in your training set (descriptors are typically derived from the connectivity in the chemical structure although atom counts and predicted values of physicochemical properties are also used). Finally, you use the ML modelling tools to find a function of the descriptors that best predicts the biological activity (or a pharmaceutically-relevant property) for the compounds in the training set. Generally you should also validate your models and this is especially important for models with large numbers of adjustable parameters.

There appears to be a general consensus that you need plenty of data for building ML models and some will even say “quantity has a quality all of its own” (this is sometimes stated as Stalin’s view of the T-34 tank although I consider this unlikely and the T-34 was actually an excellent tank which also happened to get produced in large numbers). Most people building regression-based ML models are also aware that you need a sufficiently wide spread in the measured data used for training the model (the variance in the measured data should be large in comparison with the precision of the measurement). Lead optimization is typically done within structural series and building a regression-based ML model that is predictively useful is likely to require data that have been measured for compounds in the structural series of interest.  These data requirements are quite stringent and I see this as one reason that QSAR approaches do not appear to have had much impact on the discovery of drugs despite the drug discovery literature being awash with QSAR articles. Back in 2009 (see K2009) I compared prediction-driven drug design with hypothesis-driven drug design, noting that the former is often not viable and that the latter is more commonly used in pharmaceutical and agrochemical discovery (former colleagues discussed hypothesis-driven molecular design in the context of the design-make-test-analyse cycle in the P2012 article).

With freshly painted T-34 at Brest Fortress, Belarus (June 2017)

There are some other points that you need to pay attention to when building regression-based ML models.  First, replicate measurements for the response variable (the quantity that you’re trying to predict) should be normally distributed and this is one reason why we model pIC50 rather than IC50. Second, the data values for the training set should be uniformly distributed in the descriptor space (my view, expressed in B2009, is that many 'global' predictive models are actually ensembles of local models). Third, the descriptors should not be strongly correlated or the method used to build the regression-based ML model must be able to account for relationships between descriptors (while it’s relatively straightforward to handle linear relationships between descriptors in simple regression analysis it’s not clear how effectively this can be achieved with more sophisticated algorithms used for building regression-based ML models).

I’ve created a graphic (Figure 1) to illustrate some of the modelling difficulties that result from uneven coverage in the descriptor space and it goes without saying that reality will be way more complex. The entities that occupy this chemical space are compounds and the coordinates of a point show the values of the descriptors X1 and X2 that have been calculated from the corresponding chemical structures (the terms ‘2D structure’ and ‘molecular graph’ also used). I’ve depicted real compounds for which measured data are available as black circles and virtual compounds (for which predictions are to be made) as five-pointed stars. The clusters (color-coded but also labelled A, B and C in case any readers are colour blind) are much more clearly defined than would be the case in a real chemical space. Proximity in chemical space implies similarity between compounds and the clusters might correspond to three different structural series.

Let’s suppose that we’ve been able to build a useful local model to predict pIC50 for each cluster even though we’ve not been able to build a predictively useful global model. Under this scenario you’d have a relatively high degree of confidence in the pIC50 values predicted for the virtual compounds (depicted as five-pointed stars) that lie within the clusters and a much lower degree of confidence in the virtual compound that is indicated by the arrow. If, however, we were to ignore the structure of the data and take a purely global view then we would conclude that the virtual compound indicated by the arrow occupied a central location in this region of chemical space and that the other three virtual compounds occupied peripheral locations. Put another way, the applicability domain of the model is not a single contiguous region of chemical space and what would appear to be an interpolation by a model is actually an extrapolation. 

It is important to take account of correlations between descriptors when building prediction models. A commonly employed tactic is to perform principal component analysis (PCA) which generates a new set of orthogonal descriptors and also provides an assessment of the dimensionality of the descriptor space. There are also ways to deal with correlations between descriptors in the model building process (PLS is the best known of these and the K1999 review might also be of interest). Correlations between descriptors also complicate interpretation of ML models and my stock response to any claim that an ML model is interpretable would be to ask how relationships between descriptors had been accounted for in the modelling of the data. An excellent illustrative example (see L2012) of a correlation between descriptors is the tendency of the presence of a basic nitrogen in a chemical structure to be associated with higher values of the Fsp3 descriptor (which, as pointed out in this post, should really be referred to as the I_ALI descriptor).

Let’s take another look at Figure 1. The axes of the ellipse representing Cluster A are aligned with the axes of the figure which tells us that X1 and X2 are uncorrelated for the compounds in this cluster.  Cluster B is also represented by an ellipse although its axes are not aligned with the axes of the figure which implies a linear correlation between X1 and X2 for the compounds in this cluster (you can use PCA to create two new orthogonal descriptors by rotating the plot around an axis that is perpendicular to the X1-X2 plane). Cluster C is a bigger problem because the correlation between X1 and X2 is non-linear (the cluster is not represented as an ellipse) and it would be rather more difficult to generate two new orthogonal descriptors for the compounds in this cluster. My view is that  PCA is less meaningful when there is a lot of clustering in data sets and I would also question the value of PLS and related methods in these situations. 

Let’s consider another scenario by supposing that we’ve been unable to build a useful local model for prediction of any of the three clusters in Figure 1.  If, however, the average pIC50 values differ for each of the three clusters we can still extract some predictivity from the data by finding a function of X1 and X2 that correlates with the average pIC50 values for the clusters. This is one way that clustering of compounds in the descriptor space can trick you into thinking that a global model has a broader applicability domain than is actually the case. Under this scenario it would be very unwise to try to interpret the model or use it to make predictions for compounds that sit outside the clusters. 

This is a good point at which to wrap up my post on regression-based ML (or QSAR-like if you prefer) models for predicting biological activity and other properties relevant to drug design such as aqueous solubility, permeability and metabolic stability. There appears to be a general consensus that building these models requires a lot of data and, in my view, this means that models like these are actually of limited utility in real world drug design. The basic difficulty is that a project team with enough data for building useful regression-based ML models is likely to be at a relatively advanced stage (the medicinal chemists will already understand the structure-activity relationships and be aware of project-specific issues such as poor aqueous solubility or high turnover by metabolic enzymes). Drug discovery scientists tend to be less aware of the problems that arise from clustering of compounds in descriptor space and, in my view, this is a factor that should be considered by those seeking to assemble data sets for benchmarking (see W2024). I'll leave you with a suggestion (it was considered a terrible idea at the time and probably still is by most ML thought leaders) I made over twenty years ago that each predicted value should be accompanied by chemical structures and measured values for the three closest neighbours in the descriptor space of the model.

Wednesday, 18 September 2024

Variability in biological activity measurements reported in the drug discovery literature

I'll open the post with a panorama from the summit of Shutlingsloe, sometimes referred to as Cheshire's Matterhorn, which at 506 m above sea level, is the third highest point in the county. When in the UK, I usually come here to mark the solstices and there's usually a good crowd here for the occasion (the winter solstice tends to be less well attended). 

  

The LR2024 study (Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise) that I’ll be discussing in this post highlights one of the issues that you’re likely to encounter should as you be using public domain databases such as ChEMBL to create datasets for building machine learning (ML) models for biological activity. The LR2024 study has already been reviewed in a Practical Fragments post (The limits of published data) and, using the same reference numbers as were used in the study,  I’ll also mention 10 (The Experimental Uncertainty of Heterogeneous Public Ki Data) and 11 (Comparability of Mixed IC50 Data – A Statistical Analysis). The variability in biological activity data highlighted by LR2024 stems in part from the fact that the term IC50 may refer to different quantities even when measurements are performed for the same target and inhibitor/ligand (the issue doesn’t entirely disappear when you use Ki values). I have two general concerns with the analysis LR2024 study. First, it is unclear whether the ChEMBL curation process captures assay conditions in sufficient detail to enable the user to establish that two IC50 values can be regarded as replicates of the same experiment (I stress that this is not a criticism of the curation process).  Second, combining data for different pairs of assays for calculation of correlation-based measures of assay compatibility can lead to correlation inflation. One minor gripe that I do have with the LR2024 study concerns the use of the term “noise” which, in my view, should only refer to variation in values measured under identical conditions.

I'll review LR2024 in the first part of the post before discussing points not covered by the study such as irreversible inhibition and assay interference (these can cause systematic differences in IC50 values to be observed for a particular combination of target and inhibitor even when the assays use the same substrate at the same concentration). There will be a follow up post covering how I would assemble data sets for building ML models for biological activity with some thoughts on assessment and curation of published biological activity data. As is usual for blog posts here at Molecular Design, quoted text is indented with my comments enclosed in square brackets in red italics.

In the Compatibility Issues section the authors state:

Looking beyond laboratory-to-laboratory variability of assays that are nominally the same, there are numerous reasons why literature results for different assays measured against the same “target” may not be comparable. These include the following:

  1. Different assay conditions: these can include different buffers, experimental pH, temperature, and duration. [Biochemical assays are usually run at human body temperature (37°C) although assay temperature is not always reported. The term 'duration' is pertinent to irreversible inhibition and one has to be very careful when comparing IC50 values for irreversible inhibitors. It's worth mentioning that a significant reduction in activity when an assay is run in the presence of detergent (see FS2006) is diagnostic of inhibition by colloidal aggregates (see McG2003). I categorized inhibition of this nature as “type 2 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial.] 
  2. Substrate identity and concentration: these are particularly relevant for IC50 values from competition assays, where the identity and concentration of the substrate being competed with play an important role in determining the results. Ki measures the binding affinity of a ligand to an enzyme and so its values are, in principle, not sensitive to the identity or concentration of the substrate. [My view is that one would generally need to establish that IC50 values had been determined using the same substrate and same substrate concentration if interpreting variation in the IC50 values as "noise" and it's not clear that the substrate-related information needed to establish the comparability of IC50 determinations is currently stored in ChEMBL. If concentrations and Km values are known it may be practical to use the Cheng Prusoff equation (see CP1973) to combine IC50 values measured that have been measured using different concentrations of substrate (or cofactor). It's worth noting that enzyme inhibition studies are commonly run with the substrate concentration at its Km value (see Assay Guidance Manual: Basics of Enzymatic Assays for HTS NBK92007) and there is a good chance that assays against a target using a particular substrate will have been run using very similar concentrations of the substrate. It is important to be specially careful when analysing kinase IC50 data because assays are sometimes run at high ATP concentration in order to simulate intracellular conditions (see GG2021).]
  3. Different assay technologies: since typical biochemical assays do not directly measure ligand–protein binding, the idiosrasies of different assay technologies can lead to different results for the same ligand–protein pair. (7) [Significant differences in IC50 (or Ki) values measured for a particular combination of target and compound using different assay read-outs are indicative of interference and I’ll discuss this point in more detail later in the post.]
  4. Mode of action for receptors: EC50 values can correspond to agonism, antagonism, inverse agonism, etc.  [The difficulty here stems from not being able to fully characterize the activity in terms of a concentration response (for example, agonists are characterised by both affinity and efficacy).]

The situation is further complicated when working with databases like ChEMBL, which curate literature data sets:

  1. Different targets: different variants of the same parent protein are assigned the same target ID in ChEMBL [My view is that one needs to be absolutely certain that assays have been performed using identical (including with respect to post-translational modifications) targets before interpreting differences in IC50 or Ki values as noise or experimental error.] 
  2. Different assay organism or cell types: the target protein may be recombinantly expressed in different cell types (the target ID in ChEMBL is assigned based on the original source of the target), or the assays may be run using different cell types.  [There does appear to be some confusion here and it would not generally be valid to valid to assign a ChEMBL target ID to a cell-based assay.]  
  3. Any data source can contain human errors like transcription errors or reporting incorrect units. These may be present in the original publication─when the authors report the wrong units or include results from other publications with the wrong units─or introduced during the data extraction process.

The authors describe a number of metrics for quantifying compatibility of pairs of assays in the Methods section of LR2024.  My view is that compatibility between assays should be quantified in terms of differences between pIC50 (or pKi) values and I consider correlation-based metrics to be less useful for this purpose. The degree to which pIC50 values for two assays run against a target are correlated reflects the (random) noise in each assay and the range (more accurately the variance) in the pIC50 values measured for all the compounds in each assay.  Let’s consider a couple of scenarios.  First, results from two assays are highly correlated but significantly offset from each other to a consistent extent (the assays might, for example, measure IC50 for a particular target using different substrates). Under this scenario it would be valid to include results from both assays in a single analysis (for example, by using the observed offset between pIC50 values as a correction factor) even though it would not be valid to treat the pIC50 values for compounds in the two assays as equivalent. In the second scenario, the correlation between the assays is limited by the narrowness of the range in the IC50 values measured for the compounds in the two assays. Under this scenario, differences between the pIC50 values measured for each compound can still be used to assess the compatibility of the two assays even though the range in the IC50 values is too narrow for a correlation-based metric to be useful. 

The compatibility between the two assays was measured by comparing pchembl values of overlapping compounds. [The term pchembl does need to be defined.] In addition to plotting the values, a number of metrics were used to quantify the degree of compatibility between assay pairs:

  • R2: the coefficient of determination provides a direct measure of how well the “duplicate” values in the two assays agree with each other. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.] 
  • Kendall τ: nonparametric measure of how equivalent the rankings of the measurements in the two assays are. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
  • f > 0.3: fraction of the pairs where the difference is above the estimated experimental error. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC50 values is greater than the uncertainty in either pIC50 value (an uncertainty of  0.3 in ΔpIC50 would correspond to an uncertainty of 0.2 in each of the IC50 values from which the difference had been  calculated.]
  • f > 1.0: fraction of the pairs where the difference is more than one log unit. This is an arbitrary limit for a truly meaningful activity difference. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC50 values is greater than the uncertainty in either pIC50 value (an uncertainty of  1.0 in ΔpIC50 would correspond to an uncertainty of 0.7 in each of the IC50 values from which the difference had been calculated.]
  • κbin: Cohen’s κ calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous data prior to assessment of correlations because the operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]
  • MCCbin: Matthew’s correlation coefficient calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous prior to assessment of correlations because this operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]

Let’s take a look at some of the results reported in the LR2024 study and it’s interesting that f > 0.3 and f > 1.0 values were comparable for IC50 and Ki measurements. This is an important result since Ki values do not depend on the concentration and Km of the substrate (or cofactor) and I would generally anticipate greater variation in IC50 values measured for each compound-target pair than for the corresponding Ki values. 

We first looked at the variation in the data sets when IC50 assays are combined using “only activity” curation (top panels in Figure 2). The noise level in this case is very high: 64% of the Δpchembl values are greater than 0.3, and 27% are greater than 1.0. The analogous plot for the Ki data sets is shown in Figure S1 in the Supporting Information. The noise level for Ki is comparable: 67% of the Δpchembl values are greater than 0.3, and 30% are greater than 1.0.

I consider it valid to combine data for different pairs of assays for analysis of ΔpIC50 or ΔpKi values. However, I have significant concerns about the validity of combining data for different pairs of assays for analysis of correlations between pIC50 or pKi values. The authors of LR2024 state:  

In Figure 2 and all similar plots in this study, the points are plotted such that the assay on the x-axis has a higher assay_id (this is the assay key in the SQL database, not the assay ChEMBL ID that is more familiar to users of the ChEMBL web interface) in ChEMBL32 than the assay on the y-axis. Given that assay_ids are assigned sequentially in the ChEMBL database, this means that the x-value of each point is most likely from a more recent publication than the y-value. We do not believe that this fact introduces any significant bias into our analysis.

I see two problems (one minor and one major) in preparing data in this manner for plotting and analysis of correlations over a number of assay pairs. The minor problem is that exchanging assay1 with assay2 for some of the assay pairs will generally result in different values for the correlation-based metrics for compatibility of assays. While I don’t anticipate that the differences would be large the value of a correlation-based metric for assay compatibility really shouldn’t depend on the ordering of the assays. Furthermore, the issue can be resolved by symmetrizing the dataset so that each of the pair of assay results for each compound is included both as the x-value and as the y-value. Symmetrizing the dataset in this manner doubles the number of data points and one would need to be careful if estimating confidence intervals for the correlation-based metrics for assay compatibility. I think that it would be appropriate apply a weight of 0.5 to each data point for estimation of confidence intervals although I would certainly be consulting a statistician before doing this.

However, there is also another problem (which I don't consider to be minor) with combining data for assay pairs when analysing correlations. The value of a correlation-based metric for assay compatibility reflects the variance in ΔpIC50 (or ΔpKi) values and the variance in the pIC50 (or pKi) values. The variance in pIC50 (or pKi) values when different pairs of assays that have been combined would generally be expected to be greater than for the datasets corresponding to the individual assay pairs.  Under this scenario I believe that it would be accurate to describe the correlation metrics calculated for the aggregated data as inflated (see KM2013 and the comments made therein on the HMO2016 , LS2007 and LBH2009 studies) and as a reviewer of the manuscript I would have suggested that the distribution over all assay pairs be shown for each correlation-based assay compatibility metric. When considering correlations between assays it can also be helpful, although not strictly correct, to think in terms of ranges in pIC50 values. For example, the range in pIC50 values for “only activity curation” in Figure 2 appears to be about 7 log units (I’d be extremely surprised if the range in pIC50 values for any of the individual assays even approached this figure). My view is that correlation-based metrics are not meaningful when data for multiple pairs of assays have been combined although I don't think any real harm has been done given that the authors certainly weren't trying to 'talk up' strengths of trends on the basis of the values of the correlation-based metrics. However, there is a scenario under which this type of correlation inflation would be a much bigger problem and that would be when using measures of correlation to compare measured ΔG values with values that had been calculated by free energy perturbation using different reference compounds.

So far in the post the focus has been on the analysis presented in LR2024 and now I’ll change direction by discussing a couple of topics that were not covered in that study.  I’ll start by looking at irreversible mechanisms of action and the (S2017 | McW2021 | H12024) articles cover irreversible covalent inhibition (this is the irreversible mechanism of action that ChEMBL users are most likely encounter).  You need two parameters to characterize irreversible covalent inhibition (Ki and kinact respectively quantify the affinity of the ligand for target and the rate at which the non-covalently bound ligand becomes covalently bound to target). While it is common to encounter IC50 values in the literature for irreversible covalent inhibitors these are not true concentration responses because the IC50 values also depend on factors such as pre-incubation time. Another difficulty is that articles reporting IC50 values for irreversible covalent inhibitors don’t always explicitly state that the inhibition is irreversible.

As the authors of LR2024 correctly note differences between IC50 values may be the result of using different assay technologies. Interference with assay read-out (I categorized this as “type 1 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial) should always be considered as a potential explanation for significant differences between IC50 values measured for a given combination of target and inhibitor when different assay technologies are used. An article that I recommend for learning more about this problem is SWK2009 which explains how UV/Vis absorption and fluorescence by inhibitors can cause interference with assay read-outs (the study also shows how interference can be assessed and even corrected for). When examining differences between IC50 values for the same combination of target and inhibitor it's worth bearing in mind that interference with assay read-outs tends to be more of an issue at high concentration (this is why biophysical assays tend to be favored for screening fragments). From the data analysis perspective, it’s usually safe to assume that enzyme inhibition assays using the same substrate also use the same type of assay read-out.

Differences in the technology used to prepare the solutions for assays is another potential cause of variation in IC50 values. For example, a 2010 AstraZeneca patent (US7718653B2) disclosed significant differences in IC50 values depending on whether acoustic dispensing or serial dilution was used for preparation of solutions for assay. Compounds were observed to be more potent when acoustic dispensing was used and the differences in IC50 values point to an aqueous solubility issue. The data in US7718653B2 formed the basis for the EOW2013 study.

So that brings us to the end of my review of the LR2024 study and I’ll be doing a follow up post later in the year.  One big difficulty in analysing differences between measured quantities is determining the extent to which measured values are directly comparable when IC50 can be influenced by factors such as the technology used to prepare assay solutions. Something that I think would have been worth investigating is the extent to which variability of measured values depends on potency (pIC50 measurements might be inherently more variable for less potent inhibitors than for highly potent inhibitors). The most serious criticism that I would make of LR2024 is it is not meaningful to combine data for different pairs of assays when calculating correlation-based measures of assay compatibility. 

Tuesday, 30 July 2024

A Nobel for property-based drug design?

[This post was updated on 04-Aug-2024. I thank Tim Ritchie (see RM2009 | RM2014) for bringing YG2003 (Prediction of Aqueous Solubility of Organic Compounds by Topological Descriptors) to my attention.]

"The problems of ADME are precisely those that determine success or failure of a drug in vivo. In vitro data can give a clearer picture of the receptor characteristics, but knowledge and control of ADME are also vital. A common trap in binding studies is that binding generally increases with lipophilicity, so that one may obtain extremely potent binding that is totally unattainable in vivo."

SH Unger (1987) Computer-Aided Drug Design in the Year 2000. 
Drug Information Journal 21:267-275 DOI
******************************************

In this post I’ll be reviewing an Editorial (Property-Based Drug Design Merits a Nobel Prize) that was recently published in J Med Chem. For me, the Editorial raises questions about the critical thinking skills of its authors and of the judgement of the J Med Chem Editors (I’m guessing that some of the courteous and cultured members of the Nobel Prize committee might regard it to be somewhat pushy, and possibly even uncouth, for journals to be publishing nominations for Nobel Prizes as editorials). My advice to anybody nominating individuals for a Nobel Prize is to be aware of an observation, usually attributed to Jocelyn Bell Burnett, that it’s better that people ask why you didn’t win a Nobel Prize than why you did. Where applicable, I've used the the same reference numbers that were used in the Editorial and I’ll start by reproducing the Nobel Prize proposal (as is usual in posts at Molecular Design, I’ve inserted some comments, italicized in red and enclosed in square brackets, into the quoted text):
We propose that a Nobel Prize in Physiology or Medicine should be awarded for property-based drug design, with Christopher A. Lipinski, Paul D. Leeson, and Frank Lovering as the proposed recipients for their development of “important principles for drug design” [I would describe what the proposed Nobel laureates have introduced as a rule, a metric and a molecular descriptor rather than principles.], principles that have contributed to the development of numerous approved drugs. [The authors do need to provide convincing evidence to support what appear to be some wildly extravagant claims. Specifically, the authors need to demonstrate that the rule, metric and molecular descriptor (which they describe as “principles”) were actually critical to the decision-making in projects that led to the development of numerous drugs.] While drug design previously focused primarily on optimizing potency, they introduced a more holistic approach based on the consideration of how fundamental molecular and physicochemical properties affect pharmaceutical, pharmacodynamic, pharmacokinetic, and safety properties. [My view is that none of proposed Nobel laureates even demonstrated a single convincing link between molecular and physicochemical properties, and pharmaceutical, pharmacodynamic, pharmacokinetic, and safety properties.] The development of the Rof5 by Christopher A. Lipinski in 1997 introduced a new principle for how molecular and physicochemical properties affect oral bioavailability. The development of LipE by Paul D. Leeson in 2007 introduced a new principle for how physicochemical properties impact potency, selectivity, and safety. Finally, the development of Fsp3 by Frank Lovering in 2009 introduced a new principle for how molecular shape affects pharmaceutical properties and developability.

Before examining the contributions of the three nominated individuals it's worth saying something about the objectives of drug design. First, a drug needs to be highly active against its target(s). Second, activity against anti-targets should be very low (ideally too low to even be measured). Third, as I note in 34, the exposure (concentration at the site of action) of the drug needs to be controllable (one challenge in drug design is that intracellular drug concentration can’t generally be measured in vivo and I recommend that all drug discovery scientists read SR2019). I see controlling exposure as the primary focus of property-based design and one fundamental challenge is that structural modifications that lead to increased engagement potential for the therapeutic target(s) frequently result in reduced controllability of exposure as well as increased engagement potential for anti-targets. I’ve tried to capture these points in the graphic shown below.


It's generally accepted that excessive lipophilicity and molecular size are risk factors in drug design and the “compound quality” (CQ) literature abounds with fire-and-brimstone sermons on the evils of "molecular obesity" (see H2011). Nevertheless, the relationships between these descriptors and properties such as binding affinity for anti-targets, permeability, aqueous solubility and metabolic lability are generally not quite as strong as is commonly believed (or claimed). When using trends in data to inform design it’s really important to know how strong the trends are because this tells you how much weight to give to the trends when making decisions. It’s not unknown in CQ studies for trends in data to be made to appear to be stronger than they actually are which endows the CQ field with what I’ll politely call a “whiff of the pasture” (the term “correlation inflation” has been used; see KM2013). Transformation of continuous data (IC50 values) to categorical data (high | medium | low) prior to analysis should trigger a deafening cacophony of alarm bells as should any averaging of groups of continuous data values without showing the spread in the data values. Some examples of studies in which I consider the strengths of trends to have been exaggerated include 29, 35, HMO2016 and HY2010.

I think that one thing that everybody who actually works (or has worked) on drug discovery projects agrees on is that drug discovery is really difficult. My view is that, by focusing on Rof5, LipE and Fsp3, the Editorial actually trivializes the challenges faced by drug discovery scientists. Most drug design (as opposed to ligand design) takes place during lead optimization and lead optimization teams are typically addressing specific problems (for example, structural changes that result in increased potency also result in reduced aqueous solubility).  Lead optimization teams typically work with a lot of measured data (a significant component of drug design is efficient generation of data to enable decision-making) and a weak correlation between logP and aqueous solubility reported in the literature would be of no practical relevance when the lead optimization team is using aqueous solubility measurements for compounds in the structural series that they’re optimizing. It is common (see M2001 | G2008) for the simplicity of rules, guidelines and metrics to be touted and we noted in KM2013 that:   

Given that drug discovery would appear to be anything but simple, the simplicity of a drug-likeness model could actually be taken as evidence for its irrelevance to drug discovery.

Guidelines for successful drug discovery are often presented in terms of something good (or bad) being more likely to happen when the value of a calculated property such as Fsp3 exceeds a threshold. When using guidelines like these be aware that it’s actually very difficult to set these threshold values objectively and that the guidelines would have been stated in an identical manner had different threshold values been chosen to specify them. One difficulty with using guidelines like these is that the creators of the guidelines don’t usually say what they mean by “more likely” (millions of people book flights knowing that one is “more likely” to die in a plane crash if one takes a flight than if one doesn’t take a flight). A number of published guidelines (some of which have been referenced in the Editorial) claim that compounds that comply with the guidelines are more likely to be developable. However, giving weight to these claims would require that developability be defined in an objective manner that enables compounds with arbitrary molecular structures and differing biological activity to be meaningfully compared.   

I’ll examine the contributions of the three proposed laureates for the Nobel Prize in Physiology or Medicine following the order in the Editorial. Let's start with the first:
  
The development of the Rof5 by Christopher A. Lipinski in 1997 introduced a new principle for how molecular and physicochemical properties affect oral bioavailability. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their first nominee for the Nobel Prize for Physiology or Medicine had introduced 1997.]

My view is that the publication of the Rof5 (22) has certainly proven to be highly influential in that it made many drug discovery scientists aware of the need to take account of physicochemical properties, in particular lipophilicity, in drug design. What is less well-known, but possibly more important in my view, is that publication of the Rof5 sent a clear message to Pharma/Biotech management that high-throughput screening wasn’t going to be the panacea that many believed that it would be. However, I don't see the Rof5 as quite the epiphany that the authors of the Editorial would have us believe it to be. The quote with which I started this post was taken from an article that had been published ten years before 22 and the inverse nature of the relationship between aqueous solubility and lipophilicity was being discussed in the scientific literature (see YV1980) more than forty years ago. The NC1996 study is also worthy of mention because it was published more than a year before 22 and it makes the important point that optimal logP values are likely to vary with chemotype ("each congeneric series for a drug backbone usually demonstrates its own optimal log P").       

Questions can be raised about the data analysis presented in support of the Rof5 and readers may find it helpful to take a look at the S2019 study as well as my comments on the Rof5 in HBD3 and in this post. I would argue that the Rof5 does not have any practical value as a drug design tool and I would challenge the assertion made in the Editorial that the publication of 22 demonstrated how “molecular and physicochemical properties affect oral bioavailability”. One aspect of the analysis presented (22) in support of the Rof5 that isn't always fully appreciated is that the compounds for which the descriptors are calculated were all treated as having equivalent oral bioavailability (compounds were selected for the analysis on the basis of having been taken into phase 2 clinical trials at some point before the Rof5 had been published in 1997). This is one reason that it’s not credible to assert that the analysis demonstrates that these molecular and physicochemical properties are linked to bioavailability (it must be stressed that, like many, I do actually believe that excessive lipophilicity and molecular size are risk factors in drug design). I make the following point in a blog post (I’ve modified the original text very slightly for consistency with the Editorial):

The Rof5 is stated in terms of likelihood of poor absorption or permeation although no measured oral absorption or permeability data are given in 22 and the Rof5 should therefore be regarded as a statement of belief. I realise that to make such an assertion runs the risk of an appointment with the auto-da-fé and I stress that had the Rof5 been stated in terms of physicochemical and molecular property distributions I would not have made the assertion.

To see what I was getting at let’s take a look at how the Rof5 was stated in 22 (“The ‘rule of 5’ states that: poor absorption or permeation are more likely when…”). However, the analysis presented in support of the Rof5 was of the distribution of compounds in chemical space defined by molecular weight, logP and numbers of hydrogen bond donors and acceptors with no account being taken of variation in either absorption or permeation for the compounds. Analysis like this can be informative but you need to demonstrate that the chemical space is actually relevant to the phenomena of interest. One way that you can demonstrate that a chemical space is relevant is to build predictive models for the phenomena of interest using only the dimensions of the chemical space as descriptors. Alternatively you might observe meaningful differences between the distributions in the chemical space for compounds that have respectively passed and failed at at a particular stage in clinical development.  

So that’s all that I’ll be saying about Rof5 and it’s time to take a look at the contributions of the second proposed Nobel Laureate:

The development of LipE by Paul D. Leeson in 2007 introduced a new principle for how physicochemical properties impact potency, selectivity, and safety. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their second nominee for the Nobel Prize for Physiology or Medicine had introduced 2007.]

I'll start by saying that LipE is a simple mathematical formula and I suggest that one shouldn't be confusing simple mathematical formulae with principles when nominating people for Nobel Prizes. There are, however, other errors and these are not the kind of errors that you can afford to make when nominating people for Nobel Prizes. First, the term used in 29 is actually “ligand-lipophilicity efficiency” (LLE) although this appears to have mutated to “lipophilic ligand efficiency” (also LLE) by 2014 (see H2014). The term “LipE” was actually introduced by Pfizer scientists (see R2009) and it is significant that the more recent J2018 article defines LipE in terms of logD rather than logP (doing so means that you can make compounds more efficient simply by increasing extent of ionization and, as a drug design tactic, this is likely to end about as well as things did for the Sixth Army at Stalingrad).

The second (and more serious from the perspective of a Nobel nomination) error is that the metric had already been discussed, although not named, in the literature when 29 was published (I’m guessing that a suggestion that naming a metric merits a Nobel Prize for Physiology or Medicine might cause some members of the Nobel Prize committee to choke on their surströmming).  The L2006 book chapter, published fifteen months before 29, states:

Thus, to achieve compounds with a not too high log P while still retaining potency, the difference between the log potency and the log D can be utilised.

From the A2007 perspective which was published three months before 29

Lipophilicity is thought to be a driving force for binding to anti-targets such as the hERG ion channel and cytochrome p450 enzymes and potency can be scaled by lipophilicity by subtracting measured or calculated 1-octanol water partition coefficients from pIC50.

It might be helpful to say something about efficiency metrics since LiPE (or LLE if you prefer) is an example of an efficiency metric. The idea behind efficiency metrics is to “normalize” a compound’s activity (typically quantified by potency or affinity) by the value of a risk factor such as lipophilicity or molecular size (for the masochists among you there’s an entire section in 34 on normalization of binding affinity). Ligand efficiency (LE) was introduced in 2004 (see H2004) and is generally regarded as the original efficiency metric although its creators do acknowledge the influence of the K1999 study. I’ve argued at length in 34 (Table 1 and Figure 1 in the article capture the essence of the argument) that LE is physically meaningless because perception of efficiency changes if you use a different concentration to define the standard state (by convention ΔGbinding values correspond to an arbitrary 1 M standard concentration) and there is no way to objectively select any particular value of the standard concentration for calculation of LE.  The problem doesn’t go away if you try to define ligand efficiency in terms of logarithmically expressed values of IC50, Ki or Kd instead of ΔGbinding because these quantities still have to be divided by an arbitrary concentration value in order to be expressed as logarithms (see M2011).  My view is that LE shouldn't even be described as a metric and I sometimes appropriate a quote ("it's not even wrong") that is usually attributed to Pauli because those who advocate the use of LE in drug design are unable (or unwilling) to say what it measures.

The meaninglessness of LE stems from it being defined by scaling ΔGbinding by the design risk factor (molecular size). In contrast, LipE is defined by offsetting pIC50 by the risk factor (logP) and can be interpreted (see 34) as the energetic cost of moving the ligand from octanol to its target binding site (this interpretation is only valid when the ligand binds in its neutral form and is predominantly neutral in the aqueous phase).  When considering lipophilicity in property-based design it is important to be aware that octanol is an arbitrary choice of solvent for measurement of partition coefficients and that the logP (or logD) calculated for a compound may differ significantly depending on the algorithm used for the calculations. That said, the hydrogen bond donors/acceptors and ionizable groups tend to be relatively conserved within structural series which means that the details of exactly how lipophilicity is quantified are likely to be less critical in lead optimization than for structurally-diverse sets of compounds.

When we use LipE we’re actually assuming that logP (or logD) is predictive of properties such as aqueous solubility, affinity for anti-targets and metabolic lability. That is why it’s not accurate to state that the introduction of LipE showed how “physicochemical properties impact potency, selectivity, and safety”.  In some published studies the focus is less on the LipE metric and more on what might be called the "lipophilic efficiency concept" (aim for top left corner of a plot of potency against lipophilicity). It is common to show reference lines of constant LipE to plots of potency against lipophilicity in this type of analysis and if you're doing this you really should be citing R2009 rather than 29

I'll finish the commentary on LipE (or LLE if you prefer) with this statement made in the Editorial:

Emerging from an analysis of approved drugs, this rubric predicts a compound is more likely to be clinically developable when LipE > 5. [I don’t know what the authors of the Editorial mean by “rubric” (I'm not even sure that they do) but as a reviewer of the manuscript I would have pressed them to justify their claim. Specifically I would have been looking for a literature reference (for me, the choice of the word “emerging” does rather conjure up an image of hot gases and stoned priestesses at Delphi) and a coherent explanation for why a value of 5 yields a better rubric than values of 4 or 6.]

That’s all that I’ll be saying about LipE (or LLE if you prefer) and it’s time to take a look at the contributions of the third nominee for the Nobel Prize in Physiology or Medicine:

Finally, the development of Fsp3 by Frank Lovering in 2009 introduced a new principle for how molecular shape affects pharmaceutical properties and developability. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their third nominee for the Nobel Prize for Physiology or Medicine had introduced in 2009. My view is that Fsp3 is a thoroughly unconvincing descriptor of molecular shape and I suggest readers consider the suggestion that cyclohexane (Fsp3 = 1) would have a better shape match with benzene (Fsp3 = 0) than with either methane (Fsp3 = 1) or adamantane (Fsp3 = 1).]

[04-Aug-2024 update: The Fsp3 descriptor had actually been used as i_ali in the YG2003 study (Prediction of Aqueous Solubility of Organic Compounds by Topological Descriptors) six years before the publication of 35:

The aliphatic indicator of a molecule (i_ali) is equal to the number of sp3 carbons divided by the total number of carbon atoms in the molecule.

The YG2003 study discussed prediction of aqueous solubility using i_ali (renamed as Fsp3 in 35) in conjunction with other topological descriptors. In contrast with the claims made in 35 for Fsp3 the YG2003 study made no suggestion that i_ali was a highly effective predictor of aqueous solvation when used by itself.]   

Before discussing the contributions of the third nominee for the Nobel Prize for Physiology or Medicine I should stress that I certainly consider gratuitous use of aromatic rings to be a very bad thing in drug design (it was the data analysis in 35 that was criticized in KM2013 but not the eminently sensible suggestion that drug designers should look beyond what the authors referred to as ‘Flatland’). Having sp3 carbon atoms in a scaffold provides drug designers with a wider range of options for placement of substituents than would be the case for a fully aromatic scaffold and we stated in KM2013 that:   

One limitation of aromatic rings as components of drug molecules is that some regions above and below the plane defined by the atomic nuclear positions are not directly accessible to substituents. Molecular recognition considerations suggest a focus on achieving axial substitution in saturated rings with minimal steric footprint, for example by exploiting the anomeric effect or by substituting N-acylated cyclic amines at C2. 

My view is that deleterious effects of aromatic rings on aqueous solubility would be more plausibly explained by molecular interactions stabilizing the solid state than in terms of molecular shape (this point is discussed in more detail in HBD3). I also see saturated ring systems such as bicyclo[1.1.1]pentane and cubane as potentially more resistant to metabolism than benzene. 

There’s one point that I need to make before discussing 35 from the data analysis perspective which is that molecular structures with basic nitrogen atoms tend to have higher Fsp3 values than molecular structures that lack basic nitrogen atoms (see L2013). This means that you can’t tell whether the benefits of higher Fsp3 values are actually caused by the higher Fsp3 values or by the presence of basic nitrogen atoms.

The Editorial states:

Stemming from an analysis of discovery compounds, investigational drugs, and approved drugs, Fsp3 predicts a discovery compound is more likely to become a drug when Fsp3 > 0.40. 

It’s not clear (at least to me) where the figure of 0.40 comes from and I would argue that that compound X (IC50 against therapeutic target = 50 μM; Fsp3 = 0.80) would actually be less likely to become a drug than compound Y (IC50 against therapeutic target = 10 nM; Fsp3 = 0.20). I’m assuming that what the Editorial refers to as “analysis of discovery compounds, investigational drugs, and approved drugs” is what is shown by Figure 3 in 35. Presenting data in this manner hides the variation in Fsp3 for the compounds at each stage of development and makes the trends look much stronger than they actually are (this is verboten according to current J Med Chem author guidelines). I would challenge the suggestion that what is shown in Figure 3 in 35 can be used to calculate the probability that an arbitrary compound will become a drug (my view is that it’s not feasible to even define the probability that a compound will become a drug in a meaningful manner). Analyses of success in clinical development are generally more convincing when comparisons are made between compounds that pass or fail in individual phases of clinical development than between compounds in different phases of clinical development. 

The Editorial continues:        

This observation was ascribed to increased Fsp3 leading to increased aqueous solubility, a critical physiochemical property for successful drug discovery.

I’m assuming that what the Editorial refers to as “increased Fsp3 leading to increased aqueous solubility” is the trend shown by Figure 5 of 35 (this featured prominently in the KM2013 correlation inflation article) which claims to show the relationship between Fsp3 and log S (aqueous solubility expressed as a logarithm).  This claim is not accurate because the log S values have been binned and the relationship is actually between centre point of bin and mean log S value for bin. The authors of 35 used public domain aqueous solubility data for their analysis and we showed (KM2013; see Figure 5) that the Pearson correlation coefficient for the relationship between log S and Fsp3 is only 0.25 (the corresponding value for the binned data is 0.97).  I consider the suggestion that such a weak correlation could have any relevance whatsoever to the the likelihood of success in clinical trials to be wild and uninformed conjecture.      

I'll finish my commentary on Fsp3 by reproducing this claim made in the Editorial:

Much like the Rof5 and LipE, Fsp3 has proven to be enduringly useful for the design of compounds with improved chances of clinical success. (37) [My view is there is insufficient evidence to justify this claim and I'm perplexed by the citation of 37. In any case, members of the Nobel committee are likely to focus more on whether or not Fsp3 is usefully predictive than on the endurance of this molecular descriptor.]  

It’s now time to summarise what has been a long and at times pedantic blog post, and I thank all readers who’ve stayed with me. I don’t consider any of the three studies (22 | 29 | 35) that form the basis of the Nobel Prize nomination to have reported significant scientific discoveries and I would also challenge the claim made in the Editorial that these studies introduced new principles. I’m aware that 22 is heavily cited and I certainly agree that it is common to see values of LipE and Fsp3 quoted in the drug discovery literature. Nevertheless, I would argue that that the Editorial failed to provide even a single convincing example of the Rof5, LipE or Fsp3 making a critical contribution to the discovery of a marketed drug (this should be quite sufficient to rule out the award of a share in the Nobel Prize for Physiology or Medicine to any of these nominees). Furthermore, the Editorial doesn’t provide any convincing evidence that the Rof5, LipE or Fsp3 are usefully predictive in drug discovery projects.

Aside from the failure of the Editorial to demonstrate significant impact for the Rof5, LipE and Fsp3, I do have some scientific concerns about this Nobel Prize nomination. First, the Rof5 is not actually supported by data. Second, LipE had already been discussed, although not named, in the drug discovery literature when 29 was published. Third, Fsp3 had been used previously (as i_ali) for aqueous solubility prediction and the data analysis in 35 would fail to comply with current J Med Chem author guidelines.