Molecular Design: Comparing ML models in small molecule drug discovery

To start the post I'll share a photo that I took in 2012 of incense sticks at the Truc Lam pagoda near Da Lat. Not long after taking this photo I lost a lens cap (although thankfully not the lens) riding a luge through a forest and would later visit a cricket farm (this was particularly welcome because I had developed a taste for fried crickets during a visit to Cambodia in 2005).

I’ll be reviewing A2025 (Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery) in this post. I consider the issues addressed by the authors to be extremely important and I think that the credibility of the Machine Learning (ML) field would be greatly enhanced if Editors declared words like 'outperform' to be verboten in manuscripts submitted to their journals. However, I will make a couple of criticisms of the study. First, ML modellers need to properly account for the number of adjustable parameters used to fit training data (the S2006 study goes further than this by arguing that one should also account for size of the descriptor pool). Second, ML modellers need to recognize that cross-validation can make optimistic assessments of model quality when there is high degree of clustering in training data. I’ll point you toward earlier Molecular design blog posts (Sep2024 | Oct2024 | Jul2025) that may be relevant to the discussion. As is usual for posts here at Molecular Design quoted text is indented with my comments italicised in red.

The ML models that form the focus of the A2025 study aim to predict properties (more generally behaviour) of compounds from their chemical structures. Although there is currently a lot of hype around ML models for drug discovery it’s worth bearing mind that people have been building quantitative structure-activity/property (QSAR/QSPR) models for decades (the inaugural EuroQSAR conference was held in Prague a mere five years after Czechoslovakia had been invaded by forces from the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). As I see it QSAR/QSPR approaches never really made much of a splash in real world drug discovery and my challenge to those who tout ML models as a panacea for the ills of Pharma/Biotech would be to ask why they think it’s going to be any different this time.

One of the difficulties that QSAR/QSPR practitioners faced when working within drug discovery project teams was that projects had often delivered (or had been put out of their misery) by the time there was enough data to build predictively useful models. It’s also worth pointing out that drug discovery teams have frequently delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building usefully predictive QSAR/QSPR models. Something that that many QSAR/QSPR practitioners never seemed to get is that much drug design is actually hypothesis-driven (I discussed this point 16 years ago in K2009 and I’ll point you to the P2012 article by former colleagues). A significant part of hypothesis-driven drug design is identification of exploitable features in structure activity/property relationships (SARs/SPRs) such as activity cliffs and instances of increased polarity not resulting in loss of potency. A simple plot of potency against lipophilicity might not be predictively useful but it can be still used to quantify the extent to the potency of the compound beats the trend in the data (see ‘Alternatives to ligand efficiency for normalization of affinity’ section in NoLE). My view is that hypothesis-driven drug design actually fits very naturally into an AI framework and those who tout AI as a drug design panacea appear to be missing a trick by seeing drug design as essentially an exercise in prediction.

Many of the properties of compounds of interest to ML modellers in drug discovery can be modelled as if they are equilibrium constants or rate constants (continuous-valued, dimensioned quantities) and typically fall into three general categories:

In vitro bioactivity is usually quantified in terms of potency (concentration at which a compound exhibits a specified effect in bioactivity assay) and, despite the views expressed in a rather bizarre JMC Editorial (a recent JMC Perspective provides a useful counterview and this blog post is also relevant), is the most important of the properties because you can’t compensate for inadequate potency by increasing quality of compounds or by making them more beautiful (see B2012) and I touch on this point in a recent blog post. It is important that ML modellers be aware that for some ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation the effect of a compound on the target depends on time as well as concentration. I discuss some of the issues that you need to think about when combining potency and affinity data for ML modelling of bioactivity in this blog post.
Properties considered to be relevant to ADME (absorption, distribution, metabolism, and excretion) include lipophilicity, aqueous solubility, permeability (both passive and active efflux) and plasma protein binding. While these properties are often described collectively as a compound's 'ADME profile' it's not actually accurate to do so because the ADME acronym refers to behaviour of compounds in vivo. Lipophilicity is the single most fundamental physicochemical property in drug design and it’s very important that ML modellers be aware that it's log D, rather than log P, that is measured and that the choice of octanol/water for log D measurement is entirely arbitrary.
Toxicity is typically assessed by measuring potency against anti-targets such as hERG and CYPs and cell-based assays are often used for assessment of toxicity. Generally it is more difficult to find suitable assay data for ML modelling of toxicity than is the case for modelling bioactivity against potential therapeutic targets. One reason for this is that responses in the cell-based assays commonly used to assess toxicity can't generally be linked to engagement of specific anti-targets (this is not to deny the value of the information provided by the assays for decision-making by drug discovery scientists). Furthermore, observations of potency in toxicity assays are likely to steer project teams away from the associated chemotypes and so it is very unlikely that ML modellers will encounter datasets for individual structural series with sufficient variance for building models.

When modelling properties of compounds that you believe to be relevant to small molecule drug discovery it’s important to bear in mind that even with a complete set of measured properties available it’s not generally feasible to predict what will happen when compounds are dosed in vivo. One reason for this is that the therapeutic (and adverse) effects of a drug are driven by its concentration at its site(s) of action which is a time-dependent quantity that cannot generally be measured in live humans. I argue in NoLE that the objective of the ADME-based aspects of drug design is actually to achieve controllability of exposure and one article that I recommend to all drug discovery scientists and chemical biologists is SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise).

A number of assays are available for measuring properties of interest in drug discovery and management of the ‘assay budget’ for projects is an important activity in drug discovery (especially when running assays is an outsourced activity). Drug discovery scientists typically use assays to identify and address specific design issues such as low solubility or unacceptable binding affinity for anti-targets.

In vitro assays used in drug discovery are generally configured for decision-making, rather than for building ML models, and in some cases what some might refer to as the ‘quality’ of the assay might be traded off against throughput (this doesn’t mean that the assays are somehow ‘bad’). In vitro drug discovery assays generally have both lower and upper quantitation limits and an assay’s dynamic range (you can draw an analogy between assays and analytical instruments) is given by the difference between the two values. Needless to say it is very important that ML modellers be fully aware of the lower and upper quantitation limits in the assays used to generate the data from which they will build models. This generally requires careful examination of assay details which might not have been captured by the curation processes used for databases such as ChEMBL (nor even been disclosed in the original publications). For example, maximum potency that can be quantified in a conventional enzyme inhibition assay is limited by the concentration of enzyme in the assay (see WM1979) and you’ll still need a 5 nM concentration of a picomolar inhibitor to achieve 50% inhibition of enzyme that is present in the assay at a concentration of 10 nM. I generally advise ML modellers to carefully examine the distributions in the datasets that they are modelling for evidence of cut offs that might indicate quantitation limits in the assays used to generate the data.

The effects of a drug in vivo are typically driven by its unbound concentration in plasma and assays for properties of interest in drug discovery are generally run in buffered aqueous media. It is well-known that measured values for physicochemical properties such as log D and aqueous solubility generally vary with pH for compounds with ionizable groups in their chemical structures. However, values measured for these properties can, in some scenarios, also depend on both the nature and concentration of counter-ion(s). This becomes an issue for log D measurement in cases where significant proportions of compounds are present in the organic phase in ionized forms and for aqueous solubility measurement when the measured value is limited by the solubility of a salt form (opposed to the neutral form). Dependence of measured property values on the nature and concentration of counter-ions is likely to be more of an issue when the degree of ionization (in aqueous media) is relatively high and my default advice is to consider pK_a when models underpredict log D or overpredict aqueous solubility values.

Before addressing what I consider to be the main problems with A2025 I’ll make some specific comments on the study. While these comments might appear to be pedantic (some might even use the term ‘nit-picking’) I would argue that the authors have raised the bar for themselves by claiming that their proposed “guidelines, accompanied by annotated examples using open-source software tools, lay a foundation for robust ML benchmarking and thus the development of more impactful methods”. By way of an example, if you're trying to persuade an analytical chemist to modify an aqueous solubility assay to make it more suitable for generating data to build ML models then it's not such a great idea to describe aqueous solubility as a molecular property or to confuse the range in a data set with the dynamic range of the assay used to generate the data.

In the Introduction (Section 1) the Authors state:

In drug discovery, expensive and time-consuming experiments are used to profile molecules [While it is common for drugs to be described as ‘molecules’, especially in promotional material, I generally recommend that ‘molecule’ not be used as a synonym for ‘compound’ in articles with a cheminformatic (or indeed a chemical) focus.] and gain insights into their therapeutic potential. Such experimental assays are typically organized in a cascade, where subsequent experiments test fewer molecules at a higher cost per molecule. As in silico surrogates to such experiments, both regression and classification Machine Learning (ML) models can be trained to estimate molecular properties [These are properties of compounds, as opposed to molecules, and should neither be described as ‘molecular properties’ nor as ‘small molecule properties’.] (i.e., experimental results) from chemical structure. Such models could inform drug design and prioritize experiments by scoring a set of candidate molecules. [The term ‘candidate molecules’ is as clumsy as it is inaccurate, and its meaning will not be clear to some readers. I recommend that the term ‘chemical structures’ be used instead.] These ML models thus inform high-stakes decisions [The ML models that are the focus of this study inform decisions as to which compounds should be synthesized and these decisions would not automatically be considered to be high-stakes decisions in contemporary drug discovery given developments in automation and high-throughput synthetic chemistry. It’s also important to be aware that in real life drug discovery many decisions to synthesize compounds are made with the knowledge that structural analogs have already been synthesized and shown to be active against the targets of interest. I would argue that genuinely high-stakes decisions, such as prioritization of compounds for in vivo studies, are only made after compounds have actually been synthesized and evaluated in relevant in vitro assays.] and help drug discovery research progress more quickly and efficiently. Hence, it is important that models provide reliable forecasting of experimental results.

In Section 3.3.1.3 (Dynamic Range) the Authors state:

Both correlation and error metrics are influenced by the dynamic range of the data being modeled. [I consider this use of the term ‘dynamic range’ to be incorrect and, as a reviewer, I would have pressed the Authors to explain the difference between the range of a data set and its dynamic range. As noted earlier I see dynamic range as a characteristic of an analytical instrument or an assay (which can be considered to be a type of analytical instrument) and I would argue that the term should not be applied to data sets. That said, it may be possible to infer the dynamic range of an assay through careful examination of the data.] Achieving a high correlation on data sets with a broader range of experimental values is generally easier, whereas data sets with a smaller dynamic range can produce unrealistically small values for error metrics. [While the range of a data set certainly imposes limits on variance it’s important to remember that measures of correlation are defined in terms of variance (as opposed to range) of the data. For a data set to be useful for building ML models the variance for replicate measurements needs to be small in comparison with the overall variance for the data set.] This can lead to deceptive conclusions.

With the pedantry (or nit-picking if you prefer) out of the way it’s time to take a look a what I consider to be the principal flaws of A2025. First, I consider it important to account for the number of adjustable parameters used to fit training data and, at very least, the authors should have acknowledged this as an issue. Second, I have concerns that cross-validation can lead to optimistic assessment of model quality when there is a high degree of clustering in training data and the a post from last year July might be relevant.

It’s well known that you can achieve a better fit to your data by simply using more adjustable parameters (I recommend that all ML modellers take a look at H2004 (DM Hawkins, The Problem of Overfitting, JCICS 2004 44:1-12) and my position is that it’s generally not meaningful to compare performance for models that differ in the number of adjustable parameters used to fit the training data without properly accounting for numbers of adjustable parameters. A criticism that I was making of the QSAR/QSPR field many years ago (long before ML modelling came to be touted as a panacea for the ills of Pharma/Biotech) was that many of those building models appeared to dismiss the accounting for numbers of adjustable parameters as a non-issue. It’s worth noting that building ML models typically involves selection of a subset of descriptors from a larger pool and the S2007 study argues that you also need to account for the number of descriptors in the pool when assessing model quality. Accounting for the number of adjustable parameters is not just an issue when you’re building ML models for small molecule drug discovery and this point is made in MHG2017 (Mardirossian and Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Molecular Physics, 115 2315–2372):

With semi-empirical density functionals, a measure that is commonly reported upon publication is the total number of parameters. Existing functionals based on the B97 concept have anywhere between 5 and 75 parameters. However, counting the number of parameters is often a confusing and unclear task.

The need to properly account for the number of adjustable parameters (the term 'degrees of freedom' is also used, especially in the older literature) when modelling data has been actually been recognised for many years. The agrarian economist Mordecai Ezekiel (1899-1974), who shaped much of FDR’s agricultural policy, introduced adjusted R² (link1 | link2) in Methods of Correlation Analysis which was published in 1930. The F-test (link1 | link2) can be used to assess whether the use of additional adjustable parameters is justified although I’m not aware of exactly when this particular use of the F-test was introduced. It’s also worth pointing out that Akaike information criterion (AIC) and Bayesian information criterion (BIC) appeared in the statistics literature in 1974 and 1978 respectively. I certainly wouldn’t claim to have comprehensively reviewed the importance of accounting for number of adjustable parameters when comparing ML model performance nor am I suggesting that this is something that would be easy to do. Nevertheless, I do hope that it's clear that this is not something that can simply be swept under the carpet (or even ejected from the window of an upper floor Moscow apartment).

This is a good point at which to say something about validation of ML models and I would argue that is actually very difficult to demonstrate objectively that one protocol for validation is better than another. Two general approaches for validation of ML models are to use cross-validation and to split data into a training set and an external test set (that the model never sees). A view that I’ve held since the late 1990s is that many ‘global’ models for predicting properties of compounds relevant to drug discovery are actually ensembles of local models (this view was expressed publicly in the B2009 study). I would anticipate that clustering in data sets will cause cross-validation to give optimistic assessments of model quality which in turn can lead to overfitting. I would also expect principal component analysis (PCA) to be less meaningful for highly clustered data (this is relevant because correlations between chemical structure descriptors need to be accounted for in order to calculate meaningful distances between chemical structures in the space). Something that I do need to make clear is that ‘clustering’ in the context of this post simply refers to distribution within the chemical structure descriptor space of a model.

The Authors of A2025 recommend "using a 5 × 5 repeated cross-validation procedure to sample the performance distribution” and one point that I’ll make is that they haven’t demonstrated that this protocol is more effective than 4 × 4 repeated cross-validation or 6 × 6 repeated cross-validation. While this might appear to be nit-picking I will point out that it would not be valid to invoke A2025 if criticising a future ML modelling study for using 4 × 4 repeated cross-validation (bear in mind that a substructural match against even a single PAINS filter would be considered by some to constitute the basis for a valid criticism in medicinal chemistry and K2017 might be of interest in this context).

The general approach to cross-validation is to repeatedly split the data into training sets and test sets before assessing how well on average the test data are predicted (algorithms differ as to exactly how this is done). When there is a high degree of clustering the data splits are likely to retain some members for each cluster in the training sets which can ‘anchor’ the models. Here’s what H2004 has to say:

If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead’ compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.

Another approach to validating ML models is to use external test sets although this can still lead to optimistic assessments of model quality when the available data are highly clustered. One advantage of this approach to validation is that external test sets can be ‘structured’ to provide a more detailed view of model performance (one criticism that I would make of cross-validation is that it gives a rather ‘one-dimensional’ assessment of model performance). One way to structure test sets is to characterize (by size and closeness) the neighbourhood within the training set for each object in the test set. The motivation for structuring the test sets in this manner is that it enables you to analyse relationships between prediction performance and the degree of coverage of space around test set objects by training set data. There are, however, other ways to structure test sets and my view is that classifying test set compounds according to whether they are neutral, cationic or anionic would potentially be informative when assessing models for log D, aqueous solubility, permeability, plasma protein binding, volume of distribution and hERG blockade. Although it’s not directly relevant to this post I would generally recommend that ML model predictions be presented to users along with training set data for the nearest neighbours in the model space and the most similar chemical structures in the training set.

This is a good point at which to wrap up and I concede that it’s difficult to account for numbers of adjustable fitting parameters and to meaningfully validate models when distributions of objects within the relevant chemical spaces are very uneven. That said, I would argue that creators of ML models do at least need to acknowledge these issues given that many tout models like these as essential for AI-based drug design.

Anticipating a future blog post on chemical space coverage I'll finish the post by noting that coverage is also of historical relevance. The B-52 in the photo is not in the best state of repair and this shouldn't surprise you because I took the photo during a 2005 visit to Hanoi. In those days it was considered to be good form to show disrespect for the enemy's military hardware and so I gave the wreckage a good kick. I also paid my respects to Uncle Ho whom I’m told is in much better shape than Chairman Mao (owing to the then frosty Sino-Soviet relations the latter was pickled by inexperienced compatriots rather than by the Russian experts who had pickled the former and it is said that the embalming team arrived from Moscow before Uncle Ho had actually expired). A few days later in Dien Bien Phu I caused a minor consternation by demonstrating that that the barrel of an American-made 155 mm howitzer that had been captured from the French in 1954 could still be elevated (admittedly it was a little stiff). Apparently, the French had asked the Americans if they would be so kind as to drop lots of bombs (or perhaps one very big bomb) on the Viet Minh but President Eisenhower wisely denied the request. The B-52 in the photo was one of a number sent by President Nixon (who had been President Eisenhower’s VP) to bomb North Vietnam during Operation Linebacker II (aka the Christmas Bombings) and it's my understanding that all crew members survived their encounter with the SAM.

Molecular Design

Tuesday, 21 April 2026

Comparing ML models in small molecule drug discovery

No comments: