Molecular Design: 2026

Monday, 27 July 2026

The OpenADMET initiative

“Quantity has a quality all of its own”

Original source unknown

The quote with which I’ve opened the post on the OpenADMET initiative is often attributed to Joseph Stalin and some suggest that he might have been referring to the T-34. While the safety and comfort of crews were not priorities for Soviet tank designers the T-34 was a better tank than the quote might be taken to imply. In particular, the distinctive sloped armour presented challenges for Wehrmacht anti-tank gunners (at least until the introduction of the the formidable 88 mm PAK 43) while the wider tracks of the T-34 enabled it to operate effectively in conditions that the more refined (and heavier) Tiger could not. Here are some photos of T-34s taken at the Brest Fortress and the War Memorial of Korea:

The principal objective of the OpenADMET initiative appears to be generation of measured data to enable machine learning (ML) models to be built for prediction of toxicity and absorption, distribution, metabolism, and excretion (ADME). Just as the success of the T-34 on the Eastern Front was not just down to being available in huge numbers there is a bit more to data generation than simply generating massive data sets. A significant challenge for initiatives such as OpenADMET is covering chemical space at a sufficiently fine resolution and with sufficient spread in the measured data to enable building of ML models that can predict accurately across a diverse range of chemotypes. As I’ve noted previously drug discovery project teams have delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building ML models that can accurately predict all the quantities of interest to the project teams. While I'll be criticising aspects of the OpenADMET initiative in this post it must be stressed that I do see great value to drug discovery in making relevant data freely available in the public domain (and Open Science in general).

Previously I suggested that drug design can be thought of in terms of three objectives and the OpenADMET initiative addresses the second (minimize off-target bioactivity) and third (maximize controllability of exposure) of these. As I argued in that previous post Absorption, Distribution, Metabolism, Excretion (ADME) and Toxicity (T) should be seen as separate issues in the design context (my view is that using the term ADMET projects a lack of familiarity with the practical realities of drug design). Put another way toxicity is something that drugs do to the human body while ADME determines what the human body does to drugs (pharmacy students typically encounter the distinction between pharmacodynamics and pharmacokinetics early in their training). This is a good point at which to mention the Avoid-ome and here’s a post from a couple of years ago. While I agree that toxicity fits naturally into an Avoid-ome framework, I'm unconvinced that the introduction of the term is the Great Leap Forward that some believe it to be. However, ADME issues generally cannot be accommodated within an Avoid-ome framework because ADME-based design is ultimately about control of exposure (concentration of drug in contact with the target or anti-target) and not about avoidance.

This is a good point at which to take a look at the recent F2026 article (Mapping the avoid-ome: a systematic open-science approach to predictive ADMET). As is customary for posts at this blog quoted text is indented with my comments enclosed within square brackets in red italics, and I’ve used the F2026 reference numbers.

I’ll start with the abstract:

Drug discovery often fails due to unpredictable ADMET issues, which account for 30% of clinical setbacks. [While fully agreeing that toxicity and poor ADME are important issues that that do need to be addressed more effectively I do not consider references (3) (4) (5) to support what the authors have stated here or their claim that “more than 90% of molecules created during discovery fail to meet basic ADME standards”. For example, reference (3) states that “the major causes of attrition in the clinic in 2000 were lack of efficacy (accounting for approximately 30% of failures) and safety (toxicology and clinical safety accounting for a further approximately 30%)”. All that said, decisions to take compounds into clinical development are based on measurements made in a range of assays and failure in clinical development reflects an inability of these assays to predict clinical outcomes. To more effectively address attrition we actually need new assays that are more predictive of outcomes in clinical development as opposed to new ML models that are more predictive of quantities that will need to be measured anyway.] Conventional methods lack the atomistic detail needed to navigate the “Avoid-ome”—a finite set of proteins acting as “anti-targets”. OpenADMET is an open-science initiative addressing this by creating pre-competitive, mechanistic datasets. [With respect to to "atomistic detail" it's important to bear in mind that structures for transition states (relevant when the quantity of interest is rate of turnover by metabolic enzymes) cannot actually be observed in experimental protein structural studies.] Using high-throughput structural biology, active learning, and community challenges, it builds generalizable models grounded in structural “ground truth”. [I would question the wisdom of invoking “ground truth” in scientific studies because it brings to mind stirring sermons on themes like "Pastor needs an additional private jet" and invocation of “ground truth” will endow your arguments with a distinctly pastoral odour. Building truly generalizable ML models for the quantities of interest to drug designers would certainly be of great value in drug discovery but it will take much more than anointing models with "ground truth" to achieve this.] By directly studying the Avoid-ome, OpenADMET facilitates an era of rational, multi-parameter drug design. [My view is that drug design should be described as ‘multi-objective’ rather than ‘multi-parameter’ because design against a single objective such as affinity maximisation can still be multi-parameter in nature, and I consider it tautological to describe drug design as ‘rational’.]

Here's the conclusion to F2026:

Understanding and navigating the Avoid-ome is the central universal challenge of modern drug discovery. [My view, which might be shared by a few others, is that the principal challenge for drug discovery is (and has always been) the uncertainty that results from the complexity of human biology and here’s a relevant blog post by "That Dude That Says That AI Drug Discovery Isn't So Amazing".] By creating open, structural, and mechanistic datasets and benchmarking predictive models through blind challenges, OpenADMET provides a practical foundation for a new era of rational drug design. [I am a big fan of of Open Science and see a significant value in making data relevant to drug discovery freely available. At the risk of repetition the term “rational drug design” is tautological and the promise of (yet another) “new era” will trigger the eye-rolling reflex for many experienced drug hunters. Given the aspirational nature of the initiative at this stage I think it would be more accurate to state that "OpenADMET aims to provide" rather than "OpenADMET provides".] The best way to increase the effectiveness of drug discovery in the coming decade is to stop avoiding the Avoid-ome and instead study it directly. [While I certainly see benefits from a deeper understanding of the proteins that cause toxicity and influence ADME, I don’t see doing so as quite the panacea that the authors of F2026 would have us to believe it to be. I argued over two years ago that the Avoid-ome is generally not a useful concept for consideration of ADME issues and my view is that shackling OpenADMET to the Avoid-ome has actually reduced the scientific credibility of the initiative. My advice to those leading the OpenADMET initiative is to unshackle it from the Avoid-ome before it's too late.]

To be fair the authors of F2026 do concede ensuring that that there is a lot more to ADME optimization than ensuring that anti-targets are not engaged although I would still challenge their assertion that "understanding and navigating the Avoid-ome is the central universal challenge of modern drug discovery".

While our framework heavily emphasizes the specific protein anti-targets of the Avoid-ome, we recognize that fundamental physicochemical and integrative properties—such as aqueous solubility, membrane permeability (logD), and metabolic stability are major drivers of ADMET outcomes, particularly for absorption and excretion [absorption and excretion would be more accurately described as 'ADME outcomes' than 'ADMET outcomes']. Although these factors are not mediated by a single anti-target, [these factors are not mediated by anti-targets] they are critical bulk molecular properties [I consider the term "bulk molecular properties" to be an oxymoron] that often dictate whether a compound succeeds or fails.

Having got the general stuff out of the way I’ll examine the OpenADMET initiative from the perspectives of both ADME and toxicity, starting with the latter. My view is that any off-target bioactivity is undesirable given the complexity of human biology although bioactivity against known anti-targets such as hERG is clearly unacceptable. It’s also important to take account of the concentration at which off-target effects are observed and a weakness shared by many (most?) studies of pharmacological promiscuity is that bioactivity thresholds are set far too permissively to have any physiological relevance whatsoever (LS2007 classifies compounds that exhibit >30% inhibition at 10 µM as 'active' and, more recently, FOM2025 states that "Mestres et al. (p4) anticipated that the average number of proteins with which a drug interacts with potentially relevant bioactivity (<10 µm) was close to six").

It’s perhaps appropriate to take a general look at QSAR/QSPR approaches given that the main focus of OpenADMET appears to be generation of data for training what could be referred to as 'QSAR-like' or 'QSPR-like' ML models. In my view, the impact of QSAR/QSPR modelling on real world drug discovery was limited and claims to the contrary are generally not verifiable. A difficulty faced by those advocating the use of QSAR/QSPR approaches was that projects had either delivered or been put out of their misery by the time there was sufficient data for building predictively useful models. My view is that modern ML models, like the QSAR and QSPR models that preceded them, generally can't extrapolate out of the chemical spaces in which they were trained. While I certainly wouldn’t claim ground truth for this view, I’m not aware of any studies in which a QSAR or QSPR model built using only data from one structural series was convincingly shown to be usefully predictive of for compounds in a different structural series. Medicinal chemists typically perform their optimizations within specific structural series and this means that structure-activity/property relationships (SARs/SPRs) tend to be local in nature. For users of ML models of bioactivity and other properties of compounds it is important to know whether chemical structures for which predictions are being made lie within the applicability domains of the models. Put another way, medicinal chemists who use ML models are generally much more interested in how well the models will predict for the structural series that they're working on and much less interested in how well the models have fit the training data (anybody who has received financial advice will be familiar with the "past performance is not indicative of future results" disclaimer). The selection criteria for assays and compounds by the OpenADMET initiative are not currently clear.

I see a degree of overlap between the OpenADMET and OpenBind initiatives in that safety assessment will often require prediction of binding affinity of anti-targets for compounds being considered for synthesis. Indeed, there is no reason that structures for complexes of anti-targets with ligands should be excluded from data sets when the objective is to build universal ML models for prediction of binding affinity. Nevertheless, categorical models for prediction of off-target bioactivity still have value in hit-to-lead work and lead optimization whereas categorical models for predicting on-target bioactivity are generally only useful during hit identification.

While it will often prove feasible to build ML models for binding affinity of anti-targets for ligands, many users will want to know whether the chemical structures of interest to them lie within the applicability domains of the models. As discussed in my post on the OpenBind initiative, using geometric features in protein-ligand complexes as descriptors to train ML models can potentially enable accurate affinity predictions to be made for chemotypes that are not represented in the training data. My view is that those leading the OpenADMET initiative will need to be more explicit about how (or even whether) they propose to use protein structural information for affinity prediction. Not all off-target effects of drugs can be specified in terms of reversible binding (as is the case for PXR activation which forms the basis for the current OpenADMET challenge) and this is generally more likely to be an issue when attempting to build models for off-target bioactivity.

Let's now take a look ADME from the perspective of ML modelling. As noted earlier this post, ADME and toxicity are completely different issues in drug design and using the term ‘ADMET’ conveys (at least to me) an impression that some of the practical realities of drug discovery have not been properly understood. Drug action is driven by the concentration of the drug at its site(s) of action (the term exposure is commonly used) and one of the practical realities of drug discovery is that drug concentration at sites of action generally cannot be measured in vivo unless binding sites are directly in contact with plasma (see post on the objectives of drug design). I also suggest that readers take a look at SR2019 (Smith and Rowland, Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise | DMD 2019 47:665-672) that I recommend to everybody working in drug discovery and chemical biology. While maximization of affinity is a legitimate design objective, exposure is something that needs to be carefully controlled rather than simply maximized (I'm guessing that Paracelsus might have cautioned against maximization of exposure and he’s been dead for almost half a millennium).

Pharmacokinetic/pharmacodynamic (PK/PD) modelling (DM1999 | D2008 | R2008 | NCF2011 | W2015 | SS2017 | BL2023 | HR2024 | C2024 ) is used in the later stages of drug discovery to predict the (dose-dependent) effects of drugs on humans. The inputs for PK/PD modelling are predicted human pharmacokinetic profiles (typically generated from the results from PK profiles observed in animal studies) and bioactivity measurements. In PK/PD modelling it is usual to invoke the free drug hypothesis (SYF2022 | W2025) by assuming that the concentration of a drug at its site of action is equal to the unbound concentration of the drug in plasma (the terms ‘free drug principle’ and ‘free drug theory’ are also used although I prefer ‘free drug hypothesis’ because it’s an assumption that is being made). There are two scenarios under which this assumption is known to break down. First, there is active transport at one or more points in the path between the drug’s site of administration and its site of action. Second, the drug is ionizable and its site of action is within a compartment where the pH differs from plasma pH (basic centres not required for binding to the target are generally not recommended when targeting lysosomal enzymes if you’re concerned about selectivity).

The ADME acronym refers to in vivo phenomena and physicochemical properties (lipophilicity, aqueous solubility, passive membrane permeability) and in vitro biochemical quantities (turnover by metabolic enzymes, active efflux) should be described as ‘ADME predictors’ and not ‘ADME properties’. My view (which I’ll be happy to change in the light of compelling evidence) is that it isn’t currently possible to predict in vivo plasma concentration to the level of accuracy required for PK/PD modelling if you’re only using in vitro ADME predictors. That said, I’m certainly not suggesting that genuinely predictive models for aqueous solubility, membrane permeability and turnover by metabolic enzymes are without value in drug discovery projects.

In a previous post I argued that drug designers should aim to maximize controllability of exposure and to do so requires a focus on pharmacokinetic profile rather than individual ADME predictors. In many cases the ideal pharmacokinetic profile would be that resulting from intravenous infusion whereby the plasma concentration of a drug is maintained at the minimum level required for therapeutically useful effects (the variation in plasma concentration over the dosing interval for an orally-dosed drug is actually undesirable from the perspective of simultaneously achieving efficacy and safety). While ML models for quantities such as aqueous solubility, membrane permeability and turnover by metabolic enzymes certainly have a place in drug design, decisions as to whether a compound should be evaluated in vivo will generally be based on in measured values rather than predictions from ML models.

Let’s now take a look at ADME in the context of the Avoid-ome. To be fair, the authors of F2026 do concede that ADME doesn’t quite fit into the Avoid-ome framework although this does rather beg the question as to why they assert that “understanding and navigating the Avoid-ome is the central universal challenge of modern drug discovery”. One fundamental problem with the F2026 article is that its authors have got themselves in a bit of a tangle with respect to how they’ve defined anti-targets. In drug discovery, the term ‘anti-target’ generally refers to a protein which is associated with the risk of toxicity when engaged in vivo. The authors of F2026 appear to be broadening the 'anti-target’ definition to include proteins that they believe have detrimental effects on the ADME behaviour of compounds and, in my view, it is neither valid nor useful to do so.

Let’s take a look at Fig. 1 (The set of protein anti-targets that comprise the Avoid-ome) in F2026 and you’ll see that a number of transporters are included as anti-targets in the graphic. While I certainly agree that active efflux is generally undesirable from the perspective of achieving adequate exposure, transporters would not automatically be regarded as anti-targets from the safety perspective. That said, inhibition of bile salt export pump (BESP) is considered a risk factor for drug-induced liver injury (DILI) and here's a link to a relevant article.

The pitfalls associated the broadening the definition of ‘anti-target’ are brought sharply into focus when metabolic enzymes enter the picture. Inhibition of CYPs is widely accepted as a safety issue because of the potential for drug-drug interactions (LL1998 | H2020 | L2024) but CYPs have been labelled in Fig. 1 of F2026 as metabolism (M) anti-targets. While I certainly agree that high clearance makes it difficult to maintain exposure at levels required for therapeutic benefits, a recommendation that clearance be entirely avoided would generally be regarded by drug safety expected as very bad advice indeed (remember why CYP inhibition is considered to be a safety issue). As a reviewer of F2026 I would have pressed its authors to explain why they had classified the aromatic hydrocarbon receptor (AhR) in Fig. 1 as a metabolism (M) anti-target given that the toxicity of dioxin results from engagement of this receptor (M2005 | S2014).

The authors of F2026 include serum albumin (HSA) in the Avoid-ome and it’s fair to say that plasma protein binding (PPB) is a source of confusion even for medicinal chemists (there would probably be much less confusion if unbound plasma concentration was measured directly in PK studies and I’ll direct readers to the SDK2010 article). The recent B2025 study (What Do Oral Drugs Really Look Like? Dose Regimen, Pharmacokinetics, and Safety of Recently Approved Small-Molecule Oral Drugs) reports that "most drugs are >95% plasma protein bound (58%), with a large fraction >99% bound (29%)" which appears to contradict the view expressed in F2026 that HSA should be considered as an anti-target. I would argue that PPB is potentially beneficial in that it protects the drug to some extent during first pass metabolism (distribution from plasma also protects of drug from metabolism but only after the drug has first passed through the liver). In the design context a prediction of unbound fraction is unlikely to be of much interest before a compound has been synthesized while measured values will be used in conjunction with pharmacokinetic studies.

This is a good point at which to wrap up. While OpenADMET is likely to generate useful data and models, I don’t see it as a game-changer and consider the claim that the initiative “provides a practical foundation for a new era of rational drug design” to be fanciful. Something that comes across from my reading of F2026 is a general lack of expertise in the areas of drug safety, pharmacokinetics and PK/PD modelling, and I urge those leading OpenADMET to get some experts on board (or to listen more carefully to the experts who are already on board). I have argued that the Avoid-ome is neither valid nor even useful as a framework for ADME-based drug design and my advice to those leading OpenADMET is to quietly drop the Avoid-ome.

Monday, 22 June 2026

The OpenBind initiative

I’ll open the post on the OpenBind initiative with photos from my visit last year to Korea which was timed to coincide with the cherry blossoms (this meant that the customary April Fools post was from Seoul). Things did not start well on the day that I took these photos (having lined up the first shot for the day it became abundantly clear that the camera’s battery was still being charged at the hotel) and I wondered whether Great Leader’s grandson might have labelled me as a dotard. Fortunately, Seoul’s Metro is excellent and I was still able to get some photos at Huiujeong-ro Cherry Blossom Road and Yangjaecheon Stream.

In this post I’ll be taking a look at the OpenBind initiative and here's a summary of the concept. I certainly see great value in having large quantities of this type of data (affinity measurements with X-ray crystal structures for the corresponding protein-ligand complexes) to the drug discovery and chemical biology communities. The grating-coupled interferometry (GCI) protocol used for affinity measurement enables association and dissociation to be observed in real time and presumably it is also possible to characterize stoichiometry using this technique. I would expect he GCI protocol to enable weaker binding affinities to be reliably quantified (likely to increase the dynamic range of the assay) as well as allowing measurement of binding affinity of glycoproteins for ligands. Given the focus on enabling affinity prediction, there is no reason for excluding anti-targets or non-human proteins.

Generation of data for training machine learning (ML) models, which are renowned for their voracious appetite for data, appears to be the principal aim of the initiative. However, the availability of large quantities of such data will also enable more extensive evaluation of physics-based methods for calculating binding affinity and can potentially inform hypothesis-driven design by identifying bioisosteric relationships between elements of substructure. One point worth making is that having affinity measurements linked to protein-ligand structures for structurally-related compounds of varying molecular complexity (see HLH2001) enables frustration of molecular interactions to be studied (this is particularly relevant to fragment-based design) and I discussed in HBD3 how frustration of hydration might be exploited in design. Given the importance of aqueous solvation in biomolecular recognition it may be beneficial to measure some alkane/water partition coefficient values and I'll point you to a post on this topic in case it's of interest. As discussed in KMP2013 and B2017 polarity parameters can be derived from alkane/water partition coefficient measurements for functional groups.

I've suggested that there are three objectives to drug design and the OpenBind initiative addresses the first of these which is to maximize on-target bioactivity. It's worth noting that proteins are not the only drug target class of interest (see CD2022) while bioactivity for ‘new modalities’ such as targeted protein degradation (see CC2026) and irreversible covalent bond formation between targets and ligands cannot be quantified in terms of affinity alone. My view is that OpenBind would be more accurately described as an initiative for ligand discovery than for drug discovery given its focus on enabling methods for affinity prediction. Modern ML models for affinity prediction are effectively quantitative structure-activity relationship (QSAR) models and I would question whether the use of the AI label is justified in either case. All that said, I would expect OpenBind to catalyse significant progress in the affinity prediction field which hopefully will translate to tangible benefits for drug discovery.

It’s perhaps appropriate to take a general look at QSAR approaches given that the main focus of OpenBind appears to be generation of data for training what could be referred to as 'QSAR-like' ML models. In my view, QSAR modelling never made much of a splash in real world drug discovery and claims that particular models have made significant impact on drug discovery projects are generally not verifiable. A difficulty faced by QSAR practitioners was that projects had delivered or been put out of their misery by the time there was sufficient data for building predictively useful models. Medicinal chemists typically perform their optimizations within specific structural series and this means that structure-activity relationships (SARs) tend to be local in nature (I’m not aware of any studies in which a QSAR model built using only data from one structural series was convincingly shown to be usefully predictive of bioactivity for compounds in a different structural series). For users of ML bioactivity models it is important to know whether chemical structures for which predictions are being made lie within the applicability domains of the models. Put another way, medicinal chemists who use ML models are generally more interested in how well the models predict for the structural series that they're working on and less interested in how well the models have fit the training data (anybody who has received financial advice will be familiar with the "past performance is not indicative of future results" disclaimer). The selection criteria for inclusion of targets and ligands by the OpenBind initiative are not currently clear and I'm guessing that large scale structural determination might prove challenging for membrane proteins.

The availability of affinity measurements that are linked to X-ray crystal structures for the corresponding protein-ligand complexes enables affinity to be modelled in terms of the molecular interactions between proteins and their ligands. This is the approach used to create the scoring functions used in virtual screening and it provides a means to address the local nature of SARs. While this might seem to be an obvious way to model affinity data it's important to be aware that the contribution to affinity of an individual contact, such as a hydrogen bond, between the protein and ligand is not an experimental observable (see NoLE). Put another way, there is no unique way of decomposing a value of ΔG° (standard Gibbs free energy of binding) into a sum of terms based on individual noncovalent contacts between the protein and ligand. One reason reason for this is that association of proteins with their ligands occurs in aqueous media and this point has been clearly articulated in the S2012 study:

Molecular binding in an aqueous solvent can be usefully viewed not as an association reaction, in which only new intermolecular interactions are introduced between receptor and ligand, but rather as an exchange reaction in which some receptor–solvent and ligand–solvent interactions present in the unbound state are lost to accommodate the gain of receptor–ligand interactions in the bound complex.

However, there’s another reason why there’s no unique way to decompose binding free energy into a sum of terms based on individual noncovalent contacts and here’s a well-known equation written a bit differently to how you normally see it written:

This shows that the value of ΔG° varies with the concentration, C°, that defines the standard state. By convention C° is set to 1 M although this is arbitrary and has no physical basis (see G1997) and this means that the binding free energy values encountered by drug discovery scientists are always negative (consider the feasibility of measuring a K_d value of greater than 1 M). Writing ΔG° as a sum of terms based on individual non-covalent contacts is challenging because each term needs to depend on C° while the sum of terms needs to reproduce the dependence of ΔG° on C°. This is discussed in NoLE and the problems can be seen more easily if you think about how you might write K_d as a product of terms based on individual non-covalent contacts. The dependence of ΔG° on C° has implications for interpretability of ML models for binding affinity.

My understanding is that scoring functions (see GPD2018 | WBS2017 | A2015 | C2012 | S2012 | F2004 | SR2001 | GHK2000 | MM1999 | E1997 | MSK1992) used in virtual screening are generally not predictive of affinity to the extent that they can be routinely used in lead optimization. Perhaps it will be different for Boltz-2 (described in the P2025 preprint) although questions have been raised in BSR2026 as to whether Boltz-2 "truly relies on the physics of intermolecular interactions" and the term “absolute FEP” does ring some alarm bells for me. Various explanations have been offered for the typically underwhelming performance of scoring functions for affinity prediction including the usual suspects (protein flexibility, solvation and entropy). However, a much simpler explanation might be that scoring functions are trained to predict the difference in free energy between two states by only using the structure corresponding to one of the states.

I remain sceptical that it will prove feasible to build genuinely universal models for prediction of binding affinity from structures of protein-ligand complexes although I'll be very happy if my scepticism is shown to be unfounded. Describing energetics of target-ligand interactions in a general manner to enable ML modelling of affinity will be challenging because of the necessity to encode factors such as interaction potential, geometric dependence and solvent exposure (bear in mind that physics-based methods for prediction of affinity are already available and I'll direct readers to the Open Free Energy and open forcefield initiatives). While modelling affinity in terms of molecular interactions circumvents the need for training data to sample every conceivable combination of structural series with target, the need to meaningfully define applicability domains does not disappear. My view is that when affinity datasets for different targets are combined for ML modelling, data should be split at the target level for cross-validation. This would entail splitting data so that each test set consists of only (and all) the data for a single target. I have argued in a previous post that the complexity (for example, number of parameters used to fit the training data) of models should be properly accounted for when comparing performance for ML models.

Datasets generated by OpenBind are likely to also prove valuable for testing and development of physics-based approaches to affinity prediction such as use of simulation to calculate ‘absolute’ (ΔG°) and ‘relative’ (ΔΔG) free energy of binding. Physics-based free energy calculations are typically more computationally demanding for ΔG° than for ΔΔG (a view expressed in B2009 is that it's generally easier to predict differences in property values for pairs of structurally-related compounds than it is to predict property values from chemical structures of compounds). Methods for calculating ΔΔG (here’s a helpful review) are especially relevant to drug design because medicinal chemists typically work within structural series, defining SARs in terms of ratios of affinity (or potency) for pairs of structurally-related compounds. Put another way, ΔΔG calculations enable project team scientists to exploit existing project data to predict affinity for potential synthetic candidates and I would argue that ML modellers really do need to be thinking more about prediction of differences in affinity (and other pharmaceutically relevant properties) between structurally related compounds. As an aside, free energy perturbation (FEP) was a major source of inspiration when I started to use the Leatherface (don't ask 😁😁😁) chemical structure editing software to do matched molecular pair analysis (MMPA) in the late 1990s, even though physics-based ΔΔG calculations were still largely seen as academic curiosities at that time.

While I’m certainly enthusiastic about physics-based methods such as FEP for calculating ΔΔG it’s not clear how generally these can handle significant modifications to the core of a structure (this is the scaffold-hopping scenario) and I would anticipate difficulties when the main effect of the structural perturbation is to alter conformational preference (as is the case for N-methylation of the secondary amide that is conserved in a number of SARS-CoV-2 main protease inhibitors). That said, the data generation capability of the OpenBind initiative should enable perceived weaknesses in FEP methodology to be addressed. I'll highlight a couple of general ways to use the data sets that OpenBind will generate might be used to validate methods for predicting relative affinity. First, you can use the relative affinity values that correspond to specific structural transformations such as chloro substitution (a good way to study activity cliffs and focusing on specific structural transformations counters criticism that predictive models are just capturing lipophilicity or molecular size), chloro to bromo (a good way to see if you're modelling halogen bonding effectively), and aromatic nitrogen to CH (in design it is useful to determine where polarity can be introduced with minimal loss of affinity). Second, you can use relative affinity measurements to assess how well models predict non-additivity in SARs (non-additivity can be also be considered in th activity cliff framework). I should point out that neither of these suggestions is novel (see L2012 and C2016) and activity to ML modellers are already looking at activity cliffs (see vT2022).

This is a good point at which to wrap up and I'll be taking a look at the OpenADMET initiative in the next post.

Wednesday, 3 June 2026

Two new open data initiatives

I'll open the post with photos that I took in Seoul last year at the Dongdaemun Design Plaza (DDP) which was designed by Zaha Hadid (1950-2016) and I'm ashamed to admit to only having become aware of her ten years ago while wandering around the American University of Beirut (she studied mathematics at AUB and much later designed the building there that houses the Issam Fares Institute for Public Policy and International Affairs).

In the two posts that will follow the current post I’ll be taking a look at the OpenBind and OpenADMET initiatives. A key objective of each initiative is to generate large bodies of high-quality data that will be relevant to drug discovery and make these freely available in the public domain. I certainly see massive value in open data and consider it important that vital resources such as ChEMBL and BindingDB be funded generously. As we rightly celebrate the 2024 Nobel Chemistry Prize we also need to recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. All that said, achieving a coverage of chemical space that enables usefully predictive models for diverse pharmaceutically relevant phenomena is likely to prove challenging and those leading the OpenBind and OpenADMET initiatives will need to make it clear as to how compounds are to be selected for assaying and synthesis.

Artificial Intelligence (AI) is currently touted as a panacea for the various difficulties faced by drug discovery scientists and sometimes it seems that drugs will condense out of the ether if only the experimentalists would generate enough data. It’s worth pointing out that many (most?) discovery projects that deliver candidates for clinical development do so without ever having sufficient data for building machine learning (ML) models for everything that needs to be measured. In my view this counters arguments that ML models are essential for drug discovery although I’m certainly not denying an important role for usefully predictive ML models. Many (most?) of the ML models built for drug design are essentially what we used to call quantitative structure-activity/property (QSAR/QSPR) models and I would not label them as AI even when they are used to assess chemical structures generated by AI. One challenge for ML modellers is that they need to demonstrate that their models are usefully predictive outside the chemical spaces in which they have been trained. I argued in this blog post that ML models cannot be compared simply on the basis of how well they fit the data on which they have been trained and that it is necessary to account for model complexity (typically quantified by the number of adjustable parameters used to fit the data) when asserting that one ML model is superior to another.

Recently I posted on the objectives of drug design in a way that I hoped would be useful for drug discovery scientists using AI and ML in design. Drug action is driven by concentration (it’s more accurate to describe affinity as sensitivity to a driving force than as a driving force in its own right) and another way of stating this is that the effects of a drug on the human body are determined by the concentration of the drug that is ‘seen’ by its target(s) and anti-targets. A range of bioactivity assays are used by drug discovery scientists to assess the effects of compounds on targets and anti-targets (cell-based assays are also used to assess potential toxicity) and the QSAR field came into being to enable prediction of bioactivity from chemical structures.

It’s easy understand the pharmacological objectives of drug design which can be stated as “hit the targets” and “don’t hit the anti-targets”. Thinking in terms of concentration is a bit more difficult and it’s important to be aware that the concentration of a drug at its site(s) of action (usually referred to as exposure) is generally not something that you can measure unless the target(s) are in direct contact with plasma and I recommend that everybody working in drug discovery and chemical biology read the SR2019 study. Uncertainty in exposure for intracellular targets is also an issue in clinical development because failure to meet end points in a Phase 2 trial might simply be the result of inadequate exposure. I have argued in NoLE, HBD3 and this blog post that controllability of exposure should be seen as one of the objectives of drug design.

Not being able to measure the concentration of a drug at its site of action complicates drug discovery but is an issue that can be addressed. For example, we can invoke the free drug hypothesis (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’) by assuming that the concentration of the drug at its site(s) of action is equal to its unbound plasma concentration (which can be measured). Some of this has been discussed in my post on the objectives of drug design and, in any case, I’ll be covering pharmacokinetic aspects of drug design in more detail when I post on the OpenADMET initiative.

This is a good point at which to conclude. I’ll examine OpenBind in the next post and OpenADMET in the post after that.

Wednesday, 27 May 2026

Grand Challenges for Predictive Modeling in Small Molecule Drug Discovery

In this blog post I’ll be taking a look at C2026 (Grand Challenges for Predictive Modeling in Small Molecule Drug Discovery) which has been published as a ChemXriv preprint. A well-organized collection of grand challenges can indeed help focus scientific research effort on the most important challenges and I consider C2026 to be welcome relief from the view that we can solve all problems with AI/ML. The authors put it well with their statement:

While there is substantial enthusiasm (particularly around AI) for revolutionizing drug discovery, this moment demands sharper problem definition.

In my view, however, C2026 could have been be better organized (for example, I would question why covalent binding is in DOMAIN: CHEMISTRY while pKa is in DOMAIN: PHARMACOLOGY). Nevertheless, the article is still at the preprint stage and my feedback will hopefully be helpful for the authors.

I’ll direct readers to a recent blog post (The objectives of drug design) in which I suggest that it can be helpful to see design of drugs in terms of on-target bioactivity (good things that drugs do to the human body), off-target bioactivity (bad things that drugs do the human body) and exposure (things that the human body does to drugs). Uncertainty pervades drug discovery and even if we knew the exact extent to which a targets were engaged in vivo we still wouldn’t know what effects drugs will have on patients in the absence of other information (this is the uncertainty that results from the complexity of biology). One significant source of uncertainty is that we generally can’t currently measure the concentration of a drug at its site(s) of action and I recommend that everybody working in Drug Discovery (and Chemical Biology) take a look at SR2019 (Smith & Rowland, Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise DMD 2019 47:667-672).

Some years ago I suggested that drug design could be classified as prediction-driven or hypothesis-driven and I’ll direct readers to an the P2012 article on hypothesis-driven drug design by former colleagues. Back in 2009 I stated that “in many situations, properties of compounds simply cannot be predicted with the accuracy required for meaningful design, especially when optimization is performed against multiple end points” and, despite some impressive advances in predictive chemistry since then, this is still my view. Put another way drug discovery needs to be considered in a Design of Experiments framework and I consider it an error to perceive it as simply an exercise in prediction.

The value of a prediction made using chemical structure as the only input drops sharply once a sample of the compound has been prepared and decisions as to whether further work on an existing compound is justified will invariably be based on measured data. For example, the PK/PD modelling used to set the dose will typically be based on measured bioactivity (often cell-based) and pharmacokinetics. Aside from speed the great advantage of calculating ‘relative’ (see CAS2017), as opposed to ‘absolute’ free energy is that it enables project team scientists to use existing affinity and potency measurements for design. That said, the purpose of grand challenges like these is to articulate what we need to be able to predict rather than get distracted by feasibility issues.

With the preamble out of the way I’ll focus on the grand challenges and for the remainder of the post my comments will follow the order of the manuscript. As noted in my review of A2025, 'molecule' should not be used as a synonym for either 'compound' or 'chemical structure'.

DOMAIN: CHEMISTRY

I suggest covering Covalent Binding in DOMAIN: STRUCTURE and DOMAIN: ENERGY and would include reactivity in Challenge: Chemical Stability and Degradation Products (a quinone might be perfectly stable but it’s not something that you would want to have in a enzyme inhibition assay). My view is that physicochemical properties such as pKa, aqueous solubility, aggregation and passive permeability would be more appropriately covered in DOMAIN: CHEMISTRY than in DOMAIN: PHARMACOLOGY and I would also include alkane/water partition coefficient (this is more appropriate than its octanol/water equivalent as for studying aqueous solvation and is also a better model for the core of a lipid bilayer). It might also be worth including UV-Vis absorption and fluorescence here given that both phenomena are widely exploited to assay bioactivity of compounds.

DOMAIN: STRUCTURE

Given significant interest in ‘new modalities’ I suggest referring to ‘targets’ rather than ‘proteins’ and it might be worth considering ternary structures (important in targeted protein degradation). Structures for target-ligand complexes are not directly relevant to design when association is irreversible although they are still useful starting points for building transition state models.

DOMAIN: ENERGY

Many of the quantities that form the basis of drug design fit naturally into DOMAIN: ENERGY given that they are effectively equilibrium constants or rate constants. Given significant interest in ‘new modalities’ I suggest referring to ‘targets’ rather than ‘proteins’. For irreversibly-bound ligands it's also necessary to calculate the transition state energy because target engagement occurs under kinetic control. My view is that oral absorption and drug distribution as well as modelling of enzymatic reactions (for example, oxidative metabolism by CYPs) and active transport would be easily accommodated within DOMAIN: ENERGY. One challenge that should be explicitly stated is prediction of plasma drug concentration profiles in humans because it is needed for meaningful PK/PD modelling.

DOMAIN: PHARMACOLOGY

A number of the challenges in DOMAIN: PHARMACOLOGY are not actually related to pharmacology and challenges such as Toxicity and PK/PD modelling could be accommodated within DOMAIN: ENERGY.

Wednesday, 20 May 2026

The objectives of drug design

I'll open the post on drug design objectives with photos from a most enjoyable and informative visit to the Australian Synchrotron early in 2010 when I was helping with fragment library design at CSIRO.

I’ve been meaning for ages to do a post like this and was finally goaded into action when I recently looked at two short videos from interviews with Sir Demis Hassabis, founder of Google DeepMind and Isomorphic Labs, and one of the 2024 Nobel Chemistry Prize laureates. Predicting the 3D structure of a protein from its amino acid sequence is a capability that has been eagerly sought for a long time and, as we celebrate the award, we need to also recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. We also need to recognize that protein structures are inherently flexible and subject to post translational modification such as glycosylation and phosphorylation. Furthermore, the crystal structure that has actually been determined might correspond to a relatively small portion (for example, a tyrosine kinase domain) of a much larger structure such as a dimeric growth factor receptor.

Let’s take a look at the two videos. In the first video, Sir Demis suggests that the end of disease is “within reach maybe in the next decade or so” and it’s worth pointing out that most of the cost of bringing a drug to market comes from clinical development rather than the actual discovery of the drug (nobody spends “ten years and billions of dollars to design just one drug” and it would be more accurate to say that we do so to see if what we've designed really is a drug). Furthermore, work in the late stage of drug discovery when project teams are assessing their best compounds should not really be regarded as drug design. In the second video, Sir Demis acknowledges that “knowing the structure of a protein is only one step in the drug discovery process” although it’s not clear exactly how “many adjacent AlphaFolds” are going to meaningfully address the issues of side effects.

Drug design is frequently asserted to be a multi-objective exercise and, in this post, I’ll be trying to discuss this in a way that I hope will be helpful to drug discovery scientists using artificial intelligence (AI) and machine learning (ML) in design. The ultimate aim of drug design is to identify compounds (and biological entities such as therapeutic antibodies) that can be used to treat diseases without harming patients and I suggest that this can be stated as three design objectives. My view is that the term 'multi-objective' is more appropriate than 'multi-parameter' in the context of drug design because even against a single objective design can involve optimization of multiple parameters. One characteristic of drug design is that the design process is over long before we get to find out how successfully the outputs of design perform their function (in design of materials it's possible to evaluate design outputs more directly). I recall a Head of Research and Development at Zeneca describing the process as "like steering an oil tanker".

I prefer to use the more general term ‘bioactivity’ to describe the effects of drugs on targets (and anti-targets) because in some cases these effects cannot be meaningfully described by a single parameter such as an IC₅₀ value. As an aside this is a good point at which to celebrate the recent FDA approval of the PROTAC Vepdegestrant for treatment of ESR1m, ER+/HER2- advanced breast cancer and I'll direct readers to this most excellent and timely review on targeted protein degradation. The concentration of a drug in contact with a target (or anti-target), which varies with time, is determined by dose, and by the drug’s absorption, distribution, metabolism, and excretion (commonly referred to as ADME). While the therapeutic and adverse effects of drugs are what the drug does to the body ADME is what the body does to the drug. Put another way, minimization of toxicity and optimizing ADME are entirely different objectives and I generally recommend that the acronym ADMET not be used.

Uncertainty is omnipresent in drug discovery and, despite what many appear to believe, AI/ML is not going to make this uncertainty vanish as if by magic. Derek was emphasizing the challenges presented by the complexity of biology long before AI came to be seen by some as a panacea for the ills of Pharma/Biotech (here’s a post from almost two decades ago and I also recommend reading his 2025 post on the “End of Disease” interview which also links relevant previous posts). The complexity of biology means that even if we knew the extent of target engagement in vivo (which varies with both dose and time) we wouldn’t generally be able to predict the in vivo effects of the drug with any confidence in the absence of other information. There is also uncertainty in exposure to consider and the concentration of a drug at its site(s) of action generally cannot be measured in vivo unless the target(s) are in direct contact with plasma. Uncertainty in exposure for intracellular targets is also a clinical development issue because failure in a Phase II trial may simply reflect inadequate exposure (we noted in KM2013 that “one can argue that a typical Phase I trial provides an incomplete description of distribution”). I recommend that everybody working in drug discovery and chemical biology read Smith & Rowland (2019) Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise DMD 47:667-672 DOI. I argue in NoLE that achieving controllability of exposure should be seen as an objective of drug design.

One way that pharmacokinetic/pharmacodynamic (PK/PD) modellers address the issue of intracellular exposure is to assume that the concentration of drug in contact with its target(s) (and anti-targets) equals its unbound concentration in plasma (which can be measured in real time) and this assumption is referred to as the ‘free drug hypothesis’ (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’ because it’s an assumption we’re making). There are two scenarios under which the approximation of the concentration of drug at its site(s) of action by its unbound concentration in plasma is known to be unreliable. The first scenario is that there is significant active transport at one or more points on the path between plasma and the drug’s site(s) of action (active efflux is a common problem, especially in CNS drug discovery, although active influx will still cause the assumption to break down). The second scenario is that the pH at the drug’s site(s) of action differs from plasma pH (as would be the case for a lysosomal target) and that there is an ionizable group such as a basic nitrogen in the chemical structure of the drug.

While drug design does indeed have multiple objectives it really shouldn’t need to be said that if the required level of bioactivity cannot be achieved then it becomes irrelevant whether the other objectives are achieved and I’ll direct readers to M2026 (The Affinity Advantage). I see M2026 as providing a much-needed cold shower for a 2024 JMC Editorial (Property-Based Drug Design Merits a Nobel Prize; see blog post) in which it is asserted that “a discovery compound is more likely to become a drug when Fsp3 > 0.40” and that “a compound is more likely to have good developability when PFI < 7”. Nevertheless, I don’t consider M2026 to be especially useful from the perspective of defining drug design objectives because bioactivity is typically quantified by potency rather than affinity in drug discovery projects (an assay for kinase inhibition might have been run at high ATP concentration to mimic the intracellular environment) and some bioactivity objectives are defined in terms of measurements made in cell-based assays. Furthermore, bioactivity for ‘new modalities’ such as irreversible covalent inhibition and targeted protein degradation cannot be adequately described by a single parameter such as an IC₅₀ value.

I criticized the term ‘avoid-ome’ in a previous post and, with apologies for the dreadful pun, I would recommend that its use be avoided (at the risk of repetition ADME and toxicity are entirely separate issues that must be addressed separately). Furthermore, I would question whether drug designers actually need yet another ‘ome’ word and I consider the notion that embracing the avoid-ome will transform drug discovery to be fanciful. While inhibition of cytochrome P450 (CYP) enzymes is generally undesirable from a toxicity perspective a compound that was not cleared by these metabolic enzymes would greatly worry those responsible for drug safety (bear in mind why we worry about inhibition of CYPs in the first place). Furthermore, I would challenge the inclusion by M2026 of serum albumin in a list of anti-targets such as hERG (I’m not aware of anybody suffering cardiac arrest on account of their medication binding to serum albumin) and the excellent B2025 study notes that "most drugs are >95% plasma protein bound (58%), with a large fraction >99% bound (29%)". Binding to plasma proteins should actually be considered within the framework of distribution (it can be instructive to pose the question as to whether you could tell where a drug was simply from knowing the total quantity of it in the body and its unbound plasma concentration). It’s also worth mentioning that binding to plasma proteins will protect an orally-dosed drug from the metabolizing enzymes during its first pass through the liver (before it gets a chance to distribute into the tissues). Variation of the plasma concentration during the dosing interval for an orally-dosed drug is a necessary evil resulting from oral dosing and in many situations the ‘ideal’ pharmacokinetic profile would actually be that resulting from intravenous infusion (plasma concentration of the drug is maintained at a level required for therapeutically useful effects).

At this point I’ll attempt to articulate three general objectives of drug design (the only thing that I’m entirely confident about is here that I won’t get these exactly right). One of the great challenges that drug designers face is that it is usually difficult to identify compounds that simultaneously achieve all the design objectives. Specifying criteria for objectives too permissively increases the risk of choking in clinical development. However, overly stringent specification of criteria for objectives decreases the likelihood of achieving all of the objectives and will slow the discovery process. I state these objectives in terms of ‘bioactivity’ rather than ‘potency’ to accommodate ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation although, in many cases, it will be possible to quantify the bioactivity for a compound by a single IC₅₀ or EC₅₀ value. I use ‘maximize’ and ‘minimize’ (as opposed to ‘optimize’) to frame the objectives because there is generally no penalty for identifying better compounds than you think you need. Assessing how well objectives have been achieved involves running a diverse range of assays and, as noted in this blog post on the A2025 study, it is important to be fully aware of the quantitation limits for each and every assay that you use.

I'll conclude the post with what I would argue are the three objectives of drug design:

Maximize on-target bioactivity. This is the least difficult objective to specify because bioactivity characterized in the in vitro assays is likely to translate to target engagement in vivo provided that the compound can be presented to the target(s) at the required concentration. Design outputs are usually evaluated in animal models for the human disease before initiating studies in humans but the design itself is almost invariably done against in vitro end points.
Minimize off-target bioactivity. It is generally more difficult to specify objectives for off-target bioactivity than for on-target bioactivity on account of the numbers and diversity of the assays involved. Design outputs are always evaluated for toxicity in animals before initiating studies in humans (as mandated by regulatory authorities) but the design itself is almost invariably done against in vitro end points.
Maximize controllability of exposure. This objective, which might also be stated as 'Optimize ADME', is the most difficult of the three objectives to specify because, as noted earlier in this post, exposure generally can’t be measured for targets that are not in direct contact with plasma. At absolute minimum it is necessary to demonstrate that a pharmacokinetic profile can be achieved in animals that will maintain the (unbound) concentration of the compound at levels that we believe will result in beneficial therapeutic effects in humans. For targets not in contact with plasma the PK/PD modellers also need to be able to confidently invoke the free drug hypothesis (this is why I prefer to frame the objective in terms of exposure rather than ADME) and this requires that design outputs have good passive permeability and are not subject to active transport. In some cases it will also be necessary to demonstrate access to specific organs such as the CNS.

Tuesday, 21 April 2026

Comparing ML models in small molecule drug discovery

To start the post I'll share a photo that I took in 2012 of incense sticks at the Truc Lam pagoda near Da Lat. Not long after taking this photo I lost a lens cap (although thankfully not the lens) riding a luge through a forest and would later visit a cricket farm (this was particularly welcome because I had developed a taste for fried crickets during a visit to Cambodia in 2005).

I’ll be reviewing A2025 (Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery) in this post. I consider the issues addressed by the authors to be extremely important and I think that the credibility of the Machine Learning (ML) field would be greatly enhanced if Editors declared words like 'outperform' to be verboten in manuscripts submitted to their journals. However, I will make a couple of criticisms of the study. First, ML modellers need to properly account for the number of adjustable parameters used to fit training data (the S2006 study goes further than this by arguing that one should also account for size of the descriptor pool). Second, ML modellers need to recognize that cross-validation can make optimistic assessments of model quality when there is high degree of clustering in training data. I’ll point you toward earlier Molecular design blog posts (Sep2024 | Oct2024 | Jul2025) that may be relevant to the discussion. As is usual for posts here at Molecular Design quoted text is indented with my comments italicised in red.

The ML models that form the focus of the A2025 study aim to predict properties (more generally behaviour) of compounds from their chemical structures. Although there is currently a lot of hype around ML models for drug discovery it’s worth bearing mind that people have been building quantitative structure-activity/property (QSAR/QSPR) models for decades (the inaugural EuroQSAR conference was held in Prague a mere five years after Czechoslovakia had been invaded by forces from the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). As I see it QSAR/QSPR approaches never really made much of a splash in real world drug discovery and my challenge to those who tout ML models as a panacea for the ills of Pharma/Biotech would be to ask why they think it’s going to be any different this time.

One of the difficulties that QSAR/QSPR practitioners faced when working within drug discovery project teams was that projects had often delivered (or had been put out of their misery) by the time there was enough data to build predictively useful models. It’s also worth pointing out that drug discovery teams have frequently delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building usefully predictive QSAR/QSPR models. Something that that many QSAR/QSPR practitioners never seemed to get is that much drug design is actually hypothesis-driven (I discussed this point 16 years ago in K2009 and I’ll point you to the P2012 article by former colleagues). A significant part of hypothesis-driven drug design is identification of exploitable features in structure activity/property relationships (SARs/SPRs) such as activity cliffs and instances of increased polarity not resulting in loss of potency. A simple plot of potency against lipophilicity might not be predictively useful but it can be still used to quantify the extent to the potency of the compound beats the trend in the data (see ‘Alternatives to ligand efficiency for normalization of affinity’ section in NoLE). My view is that hypothesis-driven drug design actually fits very naturally into an AI framework and those who tout AI as a drug design panacea appear to be missing a trick by seeing drug design as essentially an exercise in prediction.

Many of the properties of compounds of interest to ML modellers in drug discovery can be modelled as if they are equilibrium constants or rate constants (continuous-valued, dimensioned quantities) and typically fall into three general categories:

In vitro bioactivity is usually quantified in terms of potency (concentration at which a compound exhibits a specified effect in bioactivity assay) and, despite the views expressed in a rather bizarre JMC Editorial (a recent JMC Perspective provides a useful counterview and this blog post is also relevant), is the most important of the properties because you can’t compensate for inadequate potency by increasing quality of compounds or by making them more beautiful (see B2012) and I touch on this point in a recent blog post. It is important that ML modellers be aware that for some ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation the effect of a compound on the target depends on time as well as concentration. I discuss some of the issues that you need to think about when combining potency and affinity data for ML modelling of bioactivity in this blog post.
Properties considered to be relevant to ADME (absorption, distribution, metabolism, and excretion) include lipophilicity, aqueous solubility, permeability (both passive and active efflux) and plasma protein binding. While these properties are often described collectively as a compound's 'ADME profile' it's not actually accurate to do so because the ADME acronym refers to behaviour of compounds in vivo. Lipophilicity is the single most fundamental physicochemical property in drug design and it’s very important that ML modellers be aware that it's log D, rather than log P, that is measured and that the choice of octanol/water for log D measurement is entirely arbitrary.
Toxicity is typically assessed by measuring potency against anti-targets such as hERG and CYPs and cell-based assays are often used for assessment of toxicity. Generally it is more difficult to find suitable assay data for ML modelling of toxicity than is the case for modelling bioactivity against potential therapeutic targets. One reason for this is that responses in the cell-based assays commonly used to assess toxicity can't generally be linked to engagement of specific anti-targets (this is not to deny the value of the information provided by the assays for decision-making by drug discovery scientists). Furthermore, observations of potency in toxicity assays are likely to steer project teams away from the associated chemotypes and so it is very unlikely that ML modellers will encounter datasets for individual structural series with sufficient variance for building models.

When modelling properties of compounds that you believe to be relevant to small molecule drug discovery it’s important to bear in mind that even with a complete set of measured properties available it’s not generally feasible to predict what will happen when compounds are dosed in vivo. One reason for this is that the therapeutic (and adverse) effects of a drug are driven by its concentration at its site(s) of action which is a time-dependent quantity that cannot generally be measured in live humans. I argue in NoLE that the objective of the ADME-based aspects of drug design is actually to achieve controllability of exposure and one article that I recommend to all drug discovery scientists and chemical biologists is SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise).

A number of assays are available for measuring properties of interest in drug discovery and management of the ‘assay budget’ for projects is an important activity in drug discovery (especially when running assays is an outsourced activity). Drug discovery scientists typically use assays to identify and address specific design issues such as low solubility or unacceptable binding affinity for anti-targets.

In vitro assays used in drug discovery are generally configured for decision-making, rather than for building ML models, and in some cases what some might refer to as the ‘quality’ of the assay might be traded off against throughput (this doesn’t mean that the assays are somehow ‘bad’). In vitro drug discovery assays generally have both lower and upper quantitation limits and an assay’s dynamic range (you can draw an analogy between assays and analytical instruments) is given by the difference between the two values. Needless to say it is very important that ML modellers be fully aware of the lower and upper quantitation limits in the assays used to generate the data from which they will build models. This generally requires careful examination of assay details which might not have been captured by the curation processes used for databases such as ChEMBL (nor even been disclosed in the original publications). For example, maximum potency that can be quantified in a conventional enzyme inhibition assay is limited by the concentration of enzyme in the assay (see WM1979) and you’ll still need a 5 nM concentration of a picomolar inhibitor to achieve 50% inhibition of enzyme that is present in the assay at a concentration of 10 nM. I generally advise ML modellers to carefully examine the distributions in the datasets that they are modelling for evidence of cut offs that might indicate quantitation limits in the assays used to generate the data.

The effects of a drug in vivo are typically driven by its unbound concentration in plasma and assays for properties of interest in drug discovery are generally run in buffered aqueous media. It is well-known that measured values for physicochemical properties such as log D and aqueous solubility generally vary with pH for compounds with ionizable groups in their chemical structures. However, values measured for these properties can, in some scenarios, also depend on both the nature and concentration of counter-ion(s). This becomes an issue for log D measurement in cases where significant proportions of compounds are present in the organic phase in ionized forms and for aqueous solubility measurement when the measured value is limited by the solubility of a salt form (opposed to the neutral form). Dependence of measured property values on the nature and concentration of counter-ions is likely to be more of an issue when the degree of ionization (in aqueous media) is relatively high and my default advice is to consider pK_a when models underpredict log D or overpredict aqueous solubility values.

Before addressing what I consider to be the main problems with A2025 I’ll make some specific comments on the study. While these comments might appear to be pedantic (some might even use the term ‘nit-picking’) I would argue that the authors have raised the bar for themselves by claiming that their proposed “guidelines, accompanied by annotated examples using open-source software tools, lay a foundation for robust ML benchmarking and thus the development of more impactful methods”. By way of an example, if you're trying to persuade an analytical chemist to modify an aqueous solubility assay to make it more suitable for generating data to build ML models then it's not such a great idea to describe aqueous solubility as a molecular property or to confuse the range in a data set with the dynamic range of the assay used to generate the data.

In the Introduction (Section 1) the Authors state:

In drug discovery, expensive and time-consuming experiments are used to profile molecules [While it is common for drugs to be described as ‘molecules’, especially in promotional material, I generally recommend that ‘molecule’ not be used as a synonym for ‘compound’ in articles with a cheminformatic (or indeed a chemical) focus.] and gain insights into their therapeutic potential. Such experimental assays are typically organized in a cascade, where subsequent experiments test fewer molecules at a higher cost per molecule. As in silico surrogates to such experiments, both regression and classification Machine Learning (ML) models can be trained to estimate molecular properties [These are properties of compounds, as opposed to molecules, and should neither be described as ‘molecular properties’ nor as ‘small molecule properties’.] (i.e., experimental results) from chemical structure. Such models could inform drug design and prioritize experiments by scoring a set of candidate molecules. [The term ‘candidate molecules’ is as clumsy as it is inaccurate, and its meaning will not be clear to some readers. I recommend that the term ‘chemical structures’ be used instead.] These ML models thus inform high-stakes decisions [The ML models that are the focus of this study inform decisions as to which compounds should be synthesized and these decisions would not automatically be considered to be high-stakes decisions in contemporary drug discovery given developments in automation and high-throughput synthetic chemistry. It’s also important to be aware that in real life drug discovery many decisions to synthesize compounds are made with the knowledge that structural analogs have already been synthesized and shown to be active against the targets of interest. I would argue that genuinely high-stakes decisions, such as prioritization of compounds for in vivo studies, are only made after compounds have actually been synthesized and evaluated in relevant in vitro assays.] and help drug discovery research progress more quickly and efficiently. Hence, it is important that models provide reliable forecasting of experimental results.

In Section 3.3.1.3 (Dynamic Range) the Authors state:

Both correlation and error metrics are influenced by the dynamic range of the data being modeled. [I consider this use of the term ‘dynamic range’ to be incorrect and, as a reviewer, I would have pressed the Authors to explain the difference between the range of a data set and its dynamic range. As noted earlier I see dynamic range as a characteristic of an analytical instrument or an assay (which can be considered to be a type of analytical instrument) and I would argue that the term should not be applied to data sets. That said, it may be possible to infer the dynamic range of an assay through careful examination of the data.] Achieving a high correlation on data sets with a broader range of experimental values is generally easier, whereas data sets with a smaller dynamic range can produce unrealistically small values for error metrics. [While the range of a data set certainly imposes limits on variance it’s important to remember that measures of correlation are defined in terms of variance (as opposed to range) of the data. For a data set to be useful for building ML models the variance for replicate measurements needs to be small in comparison with the overall variance for the data set.] This can lead to deceptive conclusions.

With the pedantry (or nit-picking if you prefer) out of the way it’s time to take a look a what I consider to be the principal flaws of A2025. First, I consider it important to account for the number of adjustable parameters used to fit training data and, at very least, the authors should have acknowledged this as an issue. Second, I have concerns that cross-validation can lead to optimistic assessment of model quality when there is a high degree of clustering in training data and the a post from last year July might be relevant.

It’s well known that you can achieve a better fit to your data by simply using more adjustable parameters (I recommend that all ML modellers take a look at H2004 (DM Hawkins, The Problem of Overfitting, JCICS 2004 44:1-12) and my position is that it’s generally not meaningful to compare performance for models that differ in the number of adjustable parameters used to fit the training data without properly accounting for numbers of adjustable parameters. A criticism that I was making of the QSAR/QSPR field many years ago (long before ML modelling came to be touted as a panacea for the ills of Pharma/Biotech) was that many of those building models appeared to dismiss the accounting for numbers of adjustable parameters as a non-issue. It’s worth noting that building ML models typically involves selection of a subset of descriptors from a larger pool and the S2007 study argues that you also need to account for the number of descriptors in the pool when assessing model quality. Accounting for the number of adjustable parameters is not just an issue when you’re building ML models for small molecule drug discovery and this point is made in MHG2017 (Mardirossian and Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Molecular Physics, 115 2315–2372):

With semi-empirical density functionals, a measure that is commonly reported upon publication is the total number of parameters. Existing functionals based on the B97 concept have anywhere between 5 and 75 parameters. However, counting the number of parameters is often a confusing and unclear task.

The need to properly account for the number of adjustable parameters (the term 'degrees of freedom' is also used, especially in the older literature) when modelling data has been actually been recognised for many years. The agrarian economist Mordecai Ezekiel (1899-1974), who shaped much of FDR’s agricultural policy, introduced adjusted R² (link1 | link2) in Methods of Correlation Analysis which was published in 1930. The F-test (link1 | link2) can be used to assess whether the use of additional adjustable parameters is justified although I’m not aware of exactly when this particular use of the F-test was introduced. It’s also worth pointing out that Akaike information criterion (AIC) and Bayesian information criterion (BIC) appeared in the statistics literature in 1974 and 1978 respectively. I certainly wouldn’t claim to have comprehensively reviewed the importance of accounting for number of adjustable parameters when comparing ML model performance nor am I suggesting that this is something that would be easy to do. Nevertheless, I do hope that it's clear that this is not something that can simply be swept under the carpet (or even ejected from the window of an upper floor Moscow apartment).

This is a good point at which to say something about validation of ML models and I would argue that is actually very difficult to demonstrate objectively that one protocol for validation is better than another. Two general approaches for validation of ML models are to use cross-validation and to split data into a training set and an external test set (that the model never sees). A view that I’ve held since the late 1990s is that many ‘global’ models for predicting properties of compounds relevant to drug discovery are actually ensembles of local models (this view was expressed publicly in the B2009 study). I would anticipate that clustering in data sets will cause cross-validation to give optimistic assessments of model quality which in turn can lead to overfitting. I would also expect principal component analysis (PCA) to be less meaningful for highly clustered data (this is relevant because correlations between chemical structure descriptors need to be accounted for in order to calculate meaningful distances between chemical structures in the space). Something that I do need to make clear is that ‘clustering’ in the context of this post simply refers to distribution within the chemical structure descriptor space of a model.

The Authors of A2025 recommend "using a 5 × 5 repeated cross-validation procedure to sample the performance distribution” and one point that I’ll make is that they haven’t demonstrated that this protocol is more effective than 4 × 4 repeated cross-validation or 6 × 6 repeated cross-validation. While this might appear to be nit-picking I will point out that it would not be valid to invoke A2025 if criticising a future ML modelling study for using 4 × 4 repeated cross-validation (bear in mind that a substructural match against even a single PAINS filter would be considered by some to constitute the basis for a valid criticism in medicinal chemistry and K2017 might be of interest in this context).

The general approach to cross-validation is to repeatedly split the data into training sets and test sets before assessing how well on average the test data are predicted (algorithms differ as to exactly how this is done). When there is a high degree of clustering the data splits are likely to retain some members for each cluster in the training sets which can ‘anchor’ the models. Here’s what H2004 has to say:

If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead’ compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.

Another approach to validating ML models is to use external test sets although this can still lead to optimistic assessments of model quality when the available data are highly clustered. One advantage of this approach to validation is that external test sets can be ‘structured’ to provide a more detailed view of model performance (one criticism that I would make of cross-validation is that it gives a rather ‘one-dimensional’ assessment of model performance). One way to structure test sets is to characterize (by size and closeness) the neighbourhood within the training set for each object in the test set. The motivation for structuring the test sets in this manner is that it enables you to analyse relationships between prediction performance and the degree of coverage of space around test set objects by training set data. There are, however, other ways to structure test sets and my view is that classifying test set compounds according to whether they are neutral, cationic or anionic would potentially be informative when assessing models for log D, aqueous solubility, permeability, plasma protein binding, volume of distribution and hERG blockade. Although it’s not directly relevant to this post I would generally recommend that ML model predictions be presented to users along with training set data for the nearest neighbours in the model space and the most similar chemical structures in the training set.

This is a good point at which to wrap up and I concede that it’s difficult to account for numbers of adjustable fitting parameters and to meaningfully validate models when distributions of objects within the relevant chemical spaces are very uneven. That said, I would argue that creators of ML models do at least need to acknowledge these issues given that many tout models like these as essential for AI-based drug design.

Anticipating a future blog post on chemical space coverage I'll finish the post by noting that coverage is also of historical relevance. The B-52 in the photo is not in the best state of repair and this shouldn't surprise you because I took the photo during a 2005 visit to Hanoi. In those days it was considered to be good form to show disrespect for the enemy's military hardware and so I gave the wreckage a good kick. I also paid my respects to Uncle Ho whom I’m told is in much better shape than Chairman Mao (owing to the then frosty Sino-Soviet relations the latter was pickled by inexperienced compatriots rather than by the Russian experts who had pickled the former and it is said that the embalming team arrived from Moscow before Uncle Ho had actually expired). A few days later in Dien Bien Phu I caused a minor consternation by demonstrating that that the barrel of an American-made 155 mm howitzer that had been captured from the French in 1954 could still be elevated (admittedly it was a little stiff). Apparently, the French had asked the Americans if they would be so kind as to drop lots of bombs (or perhaps one very big bomb) on the Viet Minh but President Eisenhower wisely denied the request. The B-52 in the photo was one of a number sent by President Nixon (who had been President Eisenhower’s VP) to bomb North Vietnam during Operation Linebacker II (aka the Christmas Bombings) and it's my understanding that all crew members survived their encounter with the SAM.