Wednesday, 27 May 2026

Grand Challenges for Predictive Modeling in Small Molecule Drug Discovery

In this blog post I’ll be taking a look at C2026 (Grand Challenges for Predictive Modeling in Small Molecule Drug Discovery) which has been published as a ChemXriv preprint. A well-organized collection of grand challenges can indeed help focus scientific research effort on the most important challenges and I consider C2026 to be welcome relief from the view that we can solve all problems with AI/ML. The authors put it well with their statement:

While there is substantial enthusiasm (particularly around AI) for revolutionizing drug discovery, this moment demands sharper problem definition.

In my view, however, C2026 could have been be better organized (for example, I would question why covalent binding is in DOMAIN: CHEMISTRY while pKa is in DOMAIN: PHARMACOLOGY). Nevertheless, the article is still at the preprint stage and my feedback will hopefully be helpful for the authors.  

I’ll direct readers to a recent blog post (The objectives of drug design) in which I suggest that it can be helpful to see design of drugs in terms of on-target bioactivity (good things that drugs do to the human body), off-target bioactivity (bad things that drugs do the human body) and exposure (things that the human body does to drugs). Uncertainty pervades drug discovery and even if we knew the exact extent to which a targets were engaged in vivo we still wouldn’t know what effects drugs will have on patients in the absence of other information (this is the uncertainty that results from the complexity of biology). One significant source of uncertainty is that we generally can’t currently measure the concentration of a drug at its site(s) of action and I recommend that everybody working in Drug Discovery (and Chemical Biology) take a look at SR2019 (Smith & Rowland, Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise DMD 2019 47:667-672). 

Some years ago I suggested that drug design could be classified as prediction-driven or hypothesis-driven and I’ll direct readers to an the P2012 article on hypothesis-driven drug design by former colleagues. Back in 2009 I stated that “in many situations, properties of compounds simply cannot be predicted with the accuracy required for meaningful design, especially when optimization is performed against multiple end points” and, despite some impressive advances in predictive chemistry since then, this is still my view. Put another way drug discovery needs to be considered in a Design of Experiments framework and I consider it an error to perceive it as simply an exercise in prediction.

The value of a prediction made using chemical structure as the only input drops sharply once a sample of the compound has been prepared and decisions as to whether further work on an existing compound is justified will invariably be based on measured data. For example, the PK/PD modelling used to set the dose will typically be based on measured bioactivity (often cell-based) and pharmacokinetics. Aside from speed the great advantage of calculating ‘relative’ (see CAS2017), as opposed to ‘absolute’ free energy is that it enables project team scientists to use existing affinity and potency measurements for design. That said, the purpose of grand challenges like these is to articulate what we need to be able to predict rather than get distracted by feasibility issues.

With the preamble out of the way I’ll focus on the grand challenges and for the remainder of the post my comments will follow the order of the manuscript. As noted in my review of A2025, 'molecule' should not be used as a synonym for either 'compound' or 'chemical structure'. 

DOMAIN: CHEMISTRY

I suggest covering Covalent Binding in DOMAIN: STRUCTURE and DOMAIN: ENERGY and would include reactivity in Challenge: Chemical Stability and Degradation Products (a quinone might be perfectly stable but it’s not something that you would want to have in a enzyme inhibition assay). My view is that physicochemical properties such as pKa, aqueous solubility, aggregation and passive permeability would be more appropriately covered in DOMAIN: CHEMISTRY than in DOMAIN: PHARMACOLOGY and I would also include alkane/water partition coefficient (this is more appropriate than its octanol/water equivalent as for studying aqueous solvation and is also a better model for the core of a lipid bilayer). It might also be worth including UV-Vis absorption and fluorescence here given that both phenomena are widely exploited to assay bioactivity of compounds.

DOMAIN: STRUCTURE

Given significant interest in ‘new modalities’ I suggest referring to ‘targets’ rather than ‘proteins’ and it might be worth considering ternary structures (important in targeted protein degradation). Structures for target-ligand complexes are not directly relevant to design when association is irreversible although they are still useful starting points for building transition state models.

DOMAIN: ENERGY

Many of the quantities that form the basis of drug design fit naturally into DOMAIN: ENERGY given that they are effectively equilibrium constants or rate constants. Given significant interest in ‘new modalities’ I suggest referring to ‘targets’ rather than ‘proteins’. For irreversibly-bound ligands it's also necessary to calculate the transition state energy because target engagement occurs under kinetic control. My view is that  oral absorption and drug distribution as well as modelling of enzymatic reactions (for example, oxidative metabolism by CYPs) and active transport would be easily accommodated within DOMAIN: ENERGY.  One challenge that should be explicitly stated is prediction of plasma drug concentration profiles in humans because it is needed for meaningful PK/PD modelling.

DOMAIN: PHARMACOLOGY

A number of the challenges in DOMAIN: PHARMACOLOGY are not actually related to pharmacology and challenges such as Toxicity and PK/PD modelling could be accommodated within DOMAIN: ENERGY.

Wednesday, 20 May 2026

The objectives of drug design

I'll open the post on drug design objectives with photos from a most enjoyable and informative visit to the Australian Synchrotron early in 2010 when I was helping with fragment library design at CSIRO.



I’ve been meaning for ages to do a post like this and was finally goaded into action when I recently looked at two short videos from interviews with Sir Demis Hassabis, founder of Google DeepMind and Isomorphic Labs, and one of the 2024 Nobel Chemistry Prize laureates. Predicting the 3D structure of a protein from its amino acid sequence is a capability that has been eagerly sought for a long time and, as we celebrate the award, we need to also recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. We also need to recognize that protein structures are inherently flexible and subject to post translational modification such as glycosylation and phosphorylation. Furthermore, the crystal structure that has actually been determined might correspond to a relatively small portion (for example, a tyrosine kinase domain) of a much larger structure such as a dimeric growth factor receptor.

Let’s take a look at the two videos. In the first video, Sir Demis suggests that the end of disease is “within reach maybe in the next decade or so” and it’s worth pointing out that most of the cost of bringing a drug to market comes from clinical development rather than the actual discovery of the drug (nobody spends “ten years and billions of dollars to design just one drug” and it would be more accurate to say that we do so to see if what we've designed really is a drug). Furthermore, work in the late stage of drug discovery when project teams are assessing their best compounds should not really be regarded as drug design. In the second video, Sir Demis acknowledges that “knowing the structure of a protein is only one step in the drug discovery process” although it’s not clear exactly how “many adjacent AlphaFolds” are going to meaningfully address the issues of side effects.

Drug design is frequently asserted to be a multi-objective exercise and, in this post, I’ll be trying to discuss this in a way that I hope will be helpful to drug discovery scientists using artificial intelligence (AI) and machine learning (ML) in design. The ultimate aim of drug design is to identify compounds (and biological entities such as therapeutic antibodies) that can be used to treat diseases without harming patients and I suggest that this can be stated as three design objectives. My view is that the term 'multi-objective' is more appropriate than 'multi-parameter' in the context of drug design because even against a single objective design can involve optimization of multiple parameters. One characteristic of drug design is that the design process is over long before we get to find out how successfully the outputs of design perform their function (in design of materials it's possible to evaluate design outputs more directly). I recall a Head of Research and Development at Zeneca describing the process as "like steering an oil tanker".

I prefer to use the more general term ‘bioactivity’ to describe the effects of drugs on targets (and anti-targets) because in some cases these effects cannot be meaningfully described by a single parameter such as an IC50 value. As an aside this is a good point at which to celebrate the recent FDA approval of the PROTAC Vepdegestrant for treatment of ESR1m, ER+/HER2- advanced breast cancer and I'll direct readers to this most excellent and timely review on targeted protein degradation. The concentration of a drug in contact with a target (or anti-target), which varies with time, is determined by dose, and by the drug’s absorption, distribution, metabolism, and excretion (commonly referred to as ADME).  While the therapeutic and adverse effects of drugs are what the drug does to the body ADME is what the body does to the drug. Put another way, minimization of toxicity and optimizing ADME are entirely different objectives and I generally recommend that the acronym ADMET not be used.

Uncertainty is omnipresent in drug discovery and, despite what many appear to believe, AI/ML is not going to make this uncertainty vanish as if by magic. Derek was emphasizing the challenges presented by the complexity of biology long before AI came to be seen by some as a panacea for the ills of Pharma/Biotech (here’s a post from almost two decades ago and I also recommend reading his 2025 post on the “End of Disease” interview which also links relevant previous posts). The complexity of biology means that even if we knew the extent of target engagement in vivo (which varies with both dose and time) we wouldn’t generally be able to predict the in vivo effects of the drug with any confidence in the absence of other information. There is also uncertainty in exposure to consider and the concentration of a drug at its site(s) of action generally cannot be measured in vivo unless the target(s) are in direct contact with plasma. Uncertainty in exposure for intracellular targets is also a clinical development issue because failure in a Phase II trial may simply reflect inadequate exposure (we noted in KM2013 that “one can argue that a typical Phase I trial provides an incomplete description of distribution”). I recommend that everybody working in drug discovery and chemical biology read Smith & Rowland (2019) Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise DMD 47:667-672 DOI. I argue in NoLE that achieving controllability of exposure should be seen as an objective of drug design.

One way that pharmacokinetic pharmacodynamic (PK/PD) modellers address the issue of intracellular exposure is to assume that the concentration of drug in contact with its target(s) (and anti-targets) equals its unbound concentration in plasma (which can be measured in real time) and this assumption is referred to as the ‘free drug hypothesis’ (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’ because it’s an assumption we’re making). There are two scenarios under which the approximation of the concentration of drug at its site(s) of action by its unbound concentration in plasma is known to be unreliable. The first scenario is that there is significant active transport at one or more points on the path between plasma and the drug’s site(s) of action (active efflux is a common problem, especially in CNS drug discovery, although active influx will still cause the assumption to break down). The second scenario is that the pH at the drug’s site(s) of action differs from plasma pH (as would be the case for a lysosomal target) and that there is an ionizable group such as a basic nitrogen in the chemical structure of the drug.

While drug design does indeed have multiple objectives it really shouldn’t need to be said that if the required level of bioactivity cannot be achieved then it becomes irrelevant whether the other objectives are achieved and I’ll direct readers to M2026 (The Affinity Advantage). I see M2026 as providing a much-needed cold shower for a 2024 JMC Editorial (Property-Based Drug Design Merits a Nobel Prize; see blog post) in which it is asserted that “a discovery compound is more likely to become a drug when Fsp3 > 0.40” and that “a compound is more likely to have good developability when PFI < 7”. Nevertheless, I don’t consider M2026 to be especially useful from the perspective of defining drug design objectives because bioactivity is typically quantified by potency rather than affinity in drug discovery projects (an assay for kinase inhibition might have been run at high ATP concentration to mimic the intracellular environment) and some bioactivity objectives are defined in terms of measurements made in cell-based assays. Furthermore, bioactivity for ‘new modalities’ such as irreversible covalent inhibition and targeted protein degradation cannot be adequately described by a single parameter such as an IC50 value.

I criticized the term ‘avoid-ome’ in a previous post and, with apologies for the dreadful pun, I would recommend that its use be avoided (at the risk of repetition ADME and toxicity are entirely separate issues that must be addressed separately). Furthermore, I would question whether drug designers actually need yet another ‘ome’ word and I consider the notion that embracing the avoid-ome will transform drug discovery to be fanciful. While inhibition of cytochrome P450 (CYP) enzymes is generally undesirable from a toxicity perspective a compound that was not cleared by these metabolic enzymes would greatly worry those responsible for drug safety (bear in mind why we worry about inhibition of CYPs in the first place). Furthermore, I would challenge the inclusion by M2026 of serum albumin in a list of anti-targets such as hERG (I’m not aware of anybody suffering cardiac arrest on account of their medication binding to serum albumin) and the excellent B2025 study notes that "most drugs are >95% plasma protein bound (58%), with a large fraction >99% bound (29%)". Binding to plasma proteins should actually be considered within the framework of distribution (it can be instructive to pose the question as to whether you could tell where a drug was simply from knowing the total quantity of it in the body and its unbound plasma concentration). It’s also worth mentioning that binding to plasma proteins will protect an orally-dosed drug from the metabolizing enzymes during its first pass through the liver (before it gets a chance to distribute into the tissues). Variation of the plasma concentration during the dosing interval for an orally-dosed drug is a necessary evil resulting from oral dosing and in many situations the ‘ideal’ pharmacokinetic profile would actually be that resulting from intravenous infusion (plasma concentration of the drug is maintained at a level required for therapeutically useful effects).

At this point I’ll attempt to articulate three general objectives of drug design (the only thing that I’m entirely confident about is here that I won’t get these exactly right). One of the great challenges that drug designers face is that it is usually difficult to identify compounds that simultaneously achieve all the design objectives. Specifying criteria for objectives too permissively increases the risk of choking in clinical development.  However, overly stringent specification of criteria for objectives decreases the likelihood of achieving all of the objectives and will slow the discovery process. I state these objectives in terms of ‘bioactivity’ rather than ‘potency’ to accommodate ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation although, in many cases, it will be possible to quantify the bioactivity for a compound by a single IC50 or EC50 value. I use ‘maximize’ and ‘minimize’ (as opposed to ‘optimize’) to frame the objectives because there is generally no penalty for identifying better compounds than you think you need. Assessing how well objectives have been achieved involves running a diverse range of assays and, as noted in this blog post on the A2025 study, it is important to be fully aware of the quantitation limits for each and every assay that you use.

I'll conclude the post with what I would argue are the three objectives of drug design:

  1. Maximize on-target bioactivity.  This is the least difficult objective to specify because bioactivity characterized in the in vitro assays is likely to translate to target engagement in vivo provided that the compound can be presented to the target(s) at the required concentration. Design outputs are usually evaluated in animal models for the human disease before initiating studies in humans but the design itself is almost invariably done against in vitro end points. 
  2. Minimize off-target bioactivity. It is generally more difficult to specify objectives for off-target bioactivity than for on-target bioactivity on account of the numbers and diversity of the assays involved. Design outputs are always evaluated for toxicity in animals before initiating studies in humans (as mandated by regulatory authorities) but the design itself is almost invariably done against in vitro end points.    
  3. Maximize controllability of exposure. This objective, which might also be stated as 'Optimize ADME', is the most difficult of the three objectives to specify because, as noted earlier in this post, exposure generally can’t be measured for targets that are not in direct contact with plasma. At absolute minimum it is necessary to demonstrate that a pharmacokinetic profile can be achieved in animals that will maintain the (unbound) concentration of the compound at levels that we believe will result in beneficial therapeutic effects in humans. For targets not in contact with plasma the PK/PD modellers also need to be able to confidently invoke the free drug hypothesis (this is why I prefer to frame the objective in terms of exposure rather than ADME) and this requires that design outputs have good passive permeability and are not subject to active transport. In some cases it will also be necessary to demonstrate access to specific organs such as the CNS.  

 

Tuesday, 21 April 2026

Comparing ML models in small molecule drug discovery

To start the post I'll share a photo that I took in 2012 of incense sticks at the Truc Lam pagoda near Da Lat. Not long after taking this photo I lost a lens cap (although thankfully not the lens) riding a luge through a forest and would later visit a cricket farm (this was particularly welcome because I had developed a taste for fried crickets during a visit to Cambodia in 2005).

  

I’ll be reviewing A2025 (Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery) in this post. I consider the issues addressed by the authors to be extremely important and I think that the credibility of the Machine Learning (ML) field would be greatly enhanced if Editors declared words like 'outperform' to be verboten in manuscripts submitted to their journals. However, I will make a couple of criticisms of the study. First, ML modellers need to properly account for the number of adjustable parameters used to fit training data (the S2006 study goes further than this by arguing that one should also account for size of the descriptor pool). Second, ML modellers need to recognize that cross-validation can make optimistic assessments of model quality when there is high degree of clustering in training data. I’ll point you toward earlier Molecular design blog posts (Sep2024 | Oct2024 | Jul2025) that may be relevant to the discussion. As is usual for posts here at Molecular Design quoted text is indented with my comments italicised in red.

The ML models that form the focus of the A2025 study aim to predict properties (more generally behaviour) of compounds from their chemical structures. Although there is currently a lot of hype around ML models for drug  discovery it’s worth bearing mind that people have been building quantitative structure-activity/property (QSAR/QSPR) models for decades (the inaugural EuroQSAR conference was held in Prague a mere five years after Czechoslovakia had been invaded by forces from the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). As I see it QSAR/QSPR approaches never really made much of a splash in real world drug discovery and my challenge to those who tout ML models as a panacea for the ills of Pharma/Biotech would be to ask why they think it’s going to be any different this time.

One of the difficulties that QSAR/QSPR practitioners faced when working within drug discovery project teams was that projects had often delivered (or had been put out of their misery) by the time there was enough data to build predictively useful models. It’s also worth pointing out that drug discovery teams have frequently delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building usefully predictive QSAR/QSPR models. Something that that many QSAR/QSPR practitioners never seemed to get is that much drug design is actually hypothesis-driven (I discussed this point 16 years ago in K2009 and I’ll point you to the P2012 article by former colleagues).  A significant part of hypothesis-driven drug design is identification of exploitable features in structure activity/property relationships (SARs/SPRs) such as activity cliffs and instances of increased polarity not resulting in loss of potency.  A simple plot of potency against lipophilicity might not be predictively useful but it can be still used to quantify the extent to the potency of the compound beats the trend in the data (see ‘Alternatives to ligand efficiency for normalization of affinity’ section in NoLE). My view is that hypothesis-driven drug design actually fits very naturally into an AI framework and those who tout AI as a drug design panacea appear to be missing a trick by seeing drug design as essentially an exercise in prediction.

Many of the properties of compounds of interest to ML modellers in drug discovery can be modelled as if they are equilibrium constants or rate constants (continuous-valued, dimensioned quantities) and typically fall into three general categories: 

  1. In vitro bioactivity is usually quantified in terms of potency (concentration at which a compound exhibits a specified effect in bioactivity assay) and, despite the views expressed in a rather bizarre JMC Editorial (a recent JMC Perspective provides a useful counterview and this blog post is also relevant), is the most important of the properties because you can’t compensate for inadequate potency by increasing quality of compounds or by making them more beautiful (see B2012) and I touch on this point in a recent blog post. It is important that ML modellers be aware that for some ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation the effect of a compound on the target depends on time as well as concentration. I discuss some of the issues that you need to think about when combining potency and affinity data for ML modelling of bioactivity in this blog post.   
  2. Properties considered to be relevant to ADME (absorption, distribution, metabolism, and excretion) include lipophilicity, aqueous solubility, permeability (both passive and active efflux) and plasma protein binding. While these properties are often described collectively as a compound's 'ADME profile' it's not actually accurate to do so because the ADME acronym refers to behaviour of compounds in vivo. Lipophilicity is the single most fundamental physicochemical property in drug design and it’s very important that ML modellers be aware that it's log D, rather than log P, that is measured and that the choice of octanol/water for log D measurement is entirely arbitrary.
  3. Toxicity is typically assessed by measuring potency against anti-targets such as hERG and CYPs and cell-based assays are often used for assessment of toxicity. Generally it is more difficult to find suitable assay data for ML modelling of toxicity than is the case for modelling bioactivity against potential therapeutic targets. One reason for this is that responses in the cell-based assays commonly used to assess toxicity can't generally be linked to engagement of specific anti-targets (this is not to deny the value of the information provided by the assays for decision-making by drug discovery scientists). Furthermore, observations of potency in toxicity assays are likely to steer project teams away from the associated chemotypes and so it is very unlikely that ML modellers will encounter datasets for individual structural series with sufficient variance for building models.      

When modelling properties of compounds that you believe to be relevant to small molecule drug discovery it’s important to bear in mind that even with a complete set of measured properties available it’s not generally feasible to predict what will happen when compounds are dosed in vivo. One reason for this is that the therapeutic (and adverse) effects of a drug are driven by its concentration at its site(s) of action which is a time-dependent quantity that cannot generally be measured in live humans. I argue in NoLE that the objective of the ADME-based aspects of drug design is actually to achieve controllability of exposure and one article that I recommend to all drug discovery scientists and chemical biologists is SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise).

A number of assays are available for measuring properties of interest in drug discovery and management of the ‘assay budget’ for projects is an important activity in drug discovery (especially when running assays is an outsourced activity). Drug discovery scientists typically use assays to identify and address specific design issues such as low solubility or unacceptable binding affinity for anti-targets.  

In vitro assays used in drug discovery are generally configured for decision-making, rather than for building ML models, and in some cases what some might refer to as the ‘quality’ of the assay might be traded off against throughput (this doesn’t mean that the assays are somehow ‘bad’). In vitro drug discovery assays generally have both lower and upper quantitation limits and an assay’s dynamic range (you can draw an analogy between assays and analytical instruments) is given by the difference between the two values. Needless to say it is very important that ML modellers be fully aware of the lower and upper quantitation limits in the assays used to generate the data from which they will build models. This generally requires careful examination of assay details which might not have been captured by the curation processes used for databases such as ChEMBL (nor even been disclosed in the original publications). For example, maximum potency that can be quantified in a conventional enzyme inhibition assay is limited by the concentration of enzyme in the assay (see WM1979) and you’ll still need a 5 nM concentration of a picomolar inhibitor to achieve 50% inhibition of enzyme that is present in the assay at a concentration of 10 nM. I generally advise ML modellers to carefully examine the distributions in the datasets that they are modelling for evidence of cut offs that might indicate quantitation limits in the assays used to generate the data. 

The effects of a drug in vivo are typically driven by its unbound concentration in plasma and assays for properties of interest in drug discovery are generally run in buffered aqueous media. It is well-known that measured values for physicochemical properties such as log D and aqueous solubility generally vary with pH for compounds with ionizable groups in their chemical structures. However, values measured for these properties can, in some scenarios, also depend on both the nature and concentration of counter-ion(s). This becomes an issue for log D measurement in cases where significant proportions of compounds are present in the organic phase in ionized forms and for  aqueous solubility measurement when the measured value is limited by the solubility of a salt form (opposed to the neutral form). Dependence of measured property values on the nature and concentration of counter-ions is likely to be more of an issue when the degree of ionization (in aqueous media) is relatively high and my default advice is to consider pKa when models underpredict log D or overpredict aqueous solubility values.

Before addressing what I consider to be the main problems with A2025 I’ll make some specific comments on the study. While these comments might appear to be pedantic (some might even use the term ‘nit-picking’) I would argue that the authors have raised the bar for themselves by claiming that their proposed “guidelines, accompanied by annotated examples using open-source software tools, lay a foundation for robust ML benchmarking and thus the development of more impactful methods”.  By way of an example, if you're trying to persuade an analytical chemist to modify an aqueous solubility assay to make it more suitable for generating data to build ML models then it's not such a great idea to describe aqueous solubility as a molecular property or to confuse the range in a data set with the dynamic range of the assay used to generate the data.    

In the Introduction (Section 1) the Authors state:

In drug discovery, expensive and time-consuming experiments are used to profile molecules [While it is common for drugs to be described as ‘molecules’, especially in promotional material, I generally recommend that ‘molecule’ not be used as a synonym for ‘compound’ in articles with a cheminformatic (or indeed a chemical) focus.] and gain insights into their therapeutic potential. Such experimental assays are typically organized in a cascade, where subsequent experiments test fewer molecules at a higher cost per molecule. As in silico surrogates to such experiments, both regression and classification Machine Learning (ML) models can be trained to estimate molecular properties [These are properties of compounds, as opposed to molecules, and should neither be described as ‘molecular properties’ nor as ‘small molecule properties’.] (i.e., experimental results) from chemical structure. Such models could inform drug design and prioritize experiments by scoring a set of candidate molecules. [The term ‘candidate molecules’ is as clumsy as it is inaccurate, and its meaning will not be clear to some readers. I recommend that the term ‘chemical structures’ be used instead.] These ML models thus inform high-stakes decisions [The ML models that are the focus of this study inform decisions as to which compounds should be synthesized and these decisions would not automatically be considered to be high-stakes decisions in contemporary drug discovery given developments in automation and high-throughput synthetic chemistry. It’s also important to be aware that in real life drug discovery many decisions to synthesize compounds are made with the knowledge that structural analogs have already been synthesized and shown to be active against the targets of interest. I would argue that genuinely high-stakes decisions, such as prioritization of compounds for in vivo studies, are only made after compounds have actually been synthesized and evaluated in relevant in vitro assays.] and help drug discovery research progress more quickly and efficiently. Hence, it is important that models provide reliable forecasting of experimental results.

In Section 3.3.1.3 (Dynamic Range) the Authors state:

Both correlation and error metrics are influenced by the dynamic range of the data being modeled. [I consider this use of the term ‘dynamic range’ to be incorrect and, as a reviewer, I would have pressed the Authors to explain the difference between the range of a data set and its dynamic range. As noted earlier I see dynamic range as a characteristic of an analytical instrument or an assay (which can be considered to be a type of analytical instrument) and I would argue that the term should not be applied to data sets. That said, it may be possible to infer the dynamic range of an assay through careful examination of the data.]  Achieving a high correlation on data sets with a broader range of experimental values is generally easier, whereas data sets with a smaller dynamic range can produce unrealistically small values for error metrics. [While the range of a data set certainly imposes limits on variance it’s important to remember that measures of correlation are defined in terms of variance (as opposed to range) of the data. For a data set to be useful for building ML models the variance for replicate measurements needs to be small in comparison with the overall variance for the data set.] This can lead to deceptive conclusions.

With the pedantry (or nit-picking if you prefer) out of the way it’s time to take a look a what I consider to be the principal flaws of A2025. First, I consider it important to account for the number of adjustable parameters used to fit training data and, at very least, the authors should have acknowledged this as an issue.  Second, I have concerns that cross-validation can lead to optimistic assessment of model quality when there is a high degree of clustering in training data and the a post from last year July might be relevant.

It’s well known that you can achieve a better fit to your data by simply using more adjustable parameters (I recommend that all ML modellers take a look at H2004 (DM Hawkins, The Problem of Overfitting, JCICS 2004 44:1-12) and my position is that it’s generally not meaningful to compare performance for models that differ in the number of adjustable parameters used to fit the training data without properly accounting for numbers of adjustable parameters. A criticism that I was making of the QSAR/QSPR field many years ago (long before ML modelling came to be touted as a panacea for the ills of Pharma/Biotech) was that many of those building models appeared to dismiss the accounting for numbers of adjustable parameters as a non-issue. It’s worth noting that building ML models typically involves selection of a subset of descriptors from a larger pool and the S2007 study argues that you also need to account for the number of descriptors in the pool when assessing model quality. Accounting for the number of adjustable parameters is not just an issue when you’re building ML models for small molecule drug discovery and this point is made in MHG2017 (Mardirossian and Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Molecular Physics, 115 2315–2372):

With semi-empirical density functionals, a measure that is commonly reported upon publication is the total number of parameters. Existing functionals based on the B97 concept have anywhere between 5 and 75 parameters. However, counting the number of parameters is often a confusing and unclear task.

The need to properly account for the number of adjustable parameters (the term 'degrees of freedom' is also used, especially in the older literature) when modelling data has been actually been recognised for many years. The agrarian economist Mordecai Ezekiel (1899-1974), who shaped much of FDR’s agricultural policy, introduced adjusted R2 (link1 | link2) in Methods of Correlation Analysis which was published in 1930. The F-test (link1 | link2) can be used to assess whether the use of additional adjustable parameters is justified although I’m not aware of exactly when this particular use of the F-test was introduced.  It’s also worth pointing out that Akaike information criterion (AIC) and Bayesian information criterion (BIC) appeared in the statistics literature in 1974 and 1978 respectively. I certainly wouldn’t claim to have comprehensively reviewed the importance of accounting for number of adjustable parameters when comparing ML model performance nor am I suggesting that this is something that would be easy to do. Nevertheless, I do hope that it's clear that this is not something that can simply be swept under the carpet (or even ejected from the window of an upper floor Moscow apartment).

This is a good point at which to say something about validation of ML models and I would argue that is actually very difficult to demonstrate objectively that one protocol for validation is better than another. Two general approaches for validation of ML models are to use cross-validation and to split data into a training set and an external test set (that the model never sees). A view that I’ve held since the late 1990s is that many ‘global’ models for predicting properties of compounds relevant to drug discovery are actually ensembles of local models (this view was expressed publicly in the B2009 study). I would anticipate that clustering in data sets will cause cross-validation to give optimistic assessments of model quality which in turn can lead to overfitting. I would also expect principal component analysis (PCA) to be less meaningful for highly clustered data (this is relevant because correlations between chemical structure descriptors need to be accounted for in order to calculate meaningful distances between chemical structures in the space). Something that I do need to make clear is that ‘clustering’ in the context of this post simply refers to distribution within the chemical structure descriptor space of a model.

The Authors of A2025 recommend "using a 5 × 5 repeated cross-validation procedure to sample the performance distribution” and one point that I’ll make is that they haven’t demonstrated that this protocol is more effective than 4 × 4 repeated cross-validation or 6 × 6 repeated cross-validation. While this might appear to be nit-picking I will point out that it would not be valid to invoke A2025 if criticising a future  ML modelling study for using 4 × 4 repeated cross-validation (bear in mind that a substructural match against even a single PAINS filter would be considered by some to constitute the basis for a valid criticism in medicinal chemistry and K2017 might be of interest in this context).

The general approach to cross-validation is to repeatedly split the data into training sets and test sets before assessing how well on average the test data are predicted (algorithms differ as to exactly how this is done). When there is a high degree of clustering the data splits are likely to retain some members for each cluster in the training sets which can ‘anchor’ the models. Here’s what H2004 has to say: 

If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead’ compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.

Another approach to validating ML models is to use external test sets although this can still lead to optimistic assessments of model quality when the available data are highly clustered. One advantage of this approach to validation is that external test sets can be ‘structured’ to provide a more detailed view of model performance (one criticism that I would make of cross-validation is that it gives a rather ‘one-dimensional’ assessment of model performance). One way to structure test sets is to characterize (by size and closeness) the neighbourhood within the training set for each object in the test set. The motivation for structuring the test sets in this manner is that it enables you to analyse relationships between prediction performance and the degree of coverage of space around test set objects by training set data. There are, however, other ways to structure test sets and my view is that classifying test set compounds according to whether they are neutral, cationic or anionic would potentially be informative when assessing models for log D, aqueous solubility, permeability, plasma protein binding, volume of distribution and hERG blockade. Although it’s not directly relevant to this post I would generally recommend that ML model predictions be presented to users along with training set data for the nearest neighbours in the model space and the most similar chemical structures in the training set.

This is a good point at which to wrap up and I concede that it’s difficult to account for numbers of adjustable fitting parameters and to meaningfully validate models when distributions of objects within the relevant chemical spaces are very uneven. That said, I would argue that creators of ML models do at least need to acknowledge these issues given that many tout models like these as essential for AI-based drug design.

Anticipating a future blog post on chemical space coverage I'll finish the post by noting that coverage is also of historical relevance. The B-52 in the photo is not in the best state of repair and this shouldn't surprise you because I took the photo during a 2005 visit to Hanoi. In those days it was considered to be good form to show disrespect for the enemy's military hardware and so I gave the wreckage a good kick. I also paid my respects to Uncle Ho whom I’m told is in much better shape than Chairman Mao (owing to the then frosty Sino-Soviet relations the latter was pickled by inexperienced compatriots rather than by the Russian experts who had pickled the former and it is said that the embalming team arrived from Moscow before Uncle Ho had actually expired). A few days later in Dien Bien Phu I caused a minor consternation by demonstrating that that the barrel of an American-made 155 mm howitzer that had been captured from the French in 1954 could still be elevated (admittedly it was a little stiff). Apparently, the French had asked the Americans if they would be so kind as to drop lots of bombs (or perhaps one very big bomb) on the Viet Minh but President Eisenhower wisely denied the request. The B-52 in the photo was one of a number sent by President Nixon (who had been President Eisenhower’s VP) to bomb North Vietnam during Operation Linebacker II (aka the Christmas Bombings) and it's my understanding that all crew members survived their encounter with the SAM.

Wednesday, 1 April 2026

PAINS and Prejudice

<< previous || next >>

PAINS (pan assay interference compounds) filters have exerted a hold over the drug discovery community ever since the BH2010 study appeared over 15 years ago. Initially I didn’t take much notice of PAINS filters and, in any case, I’d already moved on from analysis of high-throughput screening (HTS) output by that point (I might add ‘thankfully’ because looking at too much HTS output is a sure-fire route to the funny farm). I started analysing HTS output from about 1993 at what was then Zeneca. I used the Daylight toolkit to create the Struct_Anal SMARTS-based chemical structure profiler in 1995 and, at that time, we were already using in house software named Flush (even at that stage it was clear that much of the HTS output being generated was going to disappear round the S-bend and our friends at what was then Rhône-Poulenc Rorer developed HARPick to ensure that nothing remained stuck to the porcelain).

Photo from 2011 at 'The Black Hole' (Los Alamos NM)

Something that had always worried me was that it was very easy to opine that a compound looked nasty but it was much more difficult to demonstrate objectively that the compound was indeed nasty. Late in 2014 a blog post, which fell well short of the standards that the drug discovery community has come to expect from Practical Fragments, prompted me to take a more forensic look at PAINS filters. What I found was that PAINS filters were based on the output from screening compounds in just six AlphaSceen assays (if a panel of six assays that all use the same read-out strikes you as suboptimal design of an experiment to detect pan-assay interference then you’re not alone). After blogging periodically about PAINS filters for a couple of years I wrote a Perspective on the topic (as noted in this blog post: from time to time, every blogger should write a journal article “pour encourager les autres”).

Nevertheless, doubts about the correctness of my position started to creep in when I was denounced for being insufficiently thoughtful in my published comments on PAINS by the authors, one of whom is a former colleague, of the seminal, insightful and Nobel-worthy ‘Seven Year Itch’ article (BN2017) which oozes wisdom and penetrating insight. Although stung by the criticism and wracked by self-doubt to the extent that I considered therapy, it was a recent study led by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, working in collaboration with the Budapest Enthalpomics Group (BEG), that removed any lingering doubts about the sublime elegance and extreme predictivity of PAINS filters. The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (not sure exactly what she’s doing these days although I understand that she’s currently visiting Port-au-Prince for a medication review with Prof. Bouchard-Duvalier). There is no doubt that this genuinely disruptive study will comprehensively reshape the predictive biochemistry landscape, enabling drug discovery scientists to accurately, meaningfully and robustly predict assay interference using only chemical structures as input for the very first time.

Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that singlet oxygen quenching is actually a conserved feature for all known and unknown mechanisms of interference with assay read-outs and that PAINS filters dramatically outperform all other methods for prediction of assay interference. The math is truly formidable (the rudimentary nature of my understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barron-Samedy transformation, followed by hepatic eigenvalue extraction using a the elegant algorithm devised by E. V. Tooms (a reclusive Baltimore resident and connoisseur of liver pâté whose illustrious thought leadership of the analytic topology field unravelled almost 32 years ago after he failed to comply with the safety instructions for an escalator). The incisive analysis of Prof. Bouchard-Duvalier shows without a shadow of doubt that singlet oxygen quenching as quantified by the AlphaScreen assay read-out is a fundamental principle in biomolecular assay science. Furthermore, ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which the grinning BEG director Prof. Kígyó Olaj explains: 

Possibilities are limitless now that we can accurately and robustly predict the assay interference that compounds will exhibit directly from their chemical structures and we can safely consign experimental biochemical assays to the dustbin of history. Surely the Journal of Medicinal Chemistry Editors will now finally recognize the colossal impact that PAINS filters have made on real world drug discovery and development when they make their FIFA Prize nominations later this year.