Molecular Design

The objectives of drug design

2026-05-20T06:40:24.284+01:00

I'll open the post on drug design objectives with photos from a most enjoyable and informative visit to the Australian Synchrotron early in 2010 when I was helping with fragment library design at CSIRO.

I’ve been meaning for ages to do a post like this and was finally goaded into action when I recently looked at two short videos from interviews with Sir Demis Hassabis, founder of Google DeepMind and Isomorphic Labs, and one of the 2024 Nobel Chemistry Prize laureates. Predicting the 3D structure of a protein from its amino acid sequence is a capability that has been eagerly sought for a long time and, as we celebrate the award, we need to also recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. We also need to recognize that protein structures are inherently flexible and subject to post translational modification such as glycosylation and phosphorylation. Furthermore, the crystal structure that has actually been determined might correspond to a relatively small portion (for example, a tyrosine kinase domain) of a much larger structure such as a dimeric growth factor receptor.

Let’s take a look at the two videos. In the first video, Sir Demis suggests that the end of disease is “within reach maybe in the next decade or so” and it’s worth pointing out that most of the cost of bringing a drug to market comes from clinical development rather than the actual discovery of the drug (nobody spends “ten years and billions of dollars to design just one drug” and it would be more accurate to say that we do so to see if what we've designed really is a drug). Furthermore, work in the late stage of drug discovery when project teams are assessing their best compounds should not really be regarded as drug design. In the second video, Sir Demis acknowledges that “knowing the structure of a protein is only one step in the drug discovery process” although it’s not clear exactly how “many adjacent AlphaFolds” are going to meaningfully address the issues of side effects.

Drug design is frequently asserted to be a multi-objective exercise and, in this post, I’ll be trying to discuss this in a way that I hope will be helpful to drug discovery scientists using artificial intelligence (AI) and machine learning (ML) in design. The ultimate aim of drug design is to identify compounds (and biological entities such as therapeutic antibodies) that can be used to treat diseases without harming patients and I suggest that this can be stated as three design objectives. My view is that the term 'multi-objective' is more appropriate than 'multi-parameter' in the context of drug design because even against a single objective design can involve optimization of multiple parameters. One characteristic of drug design is that the design process is over long before we get to find out how successfully the outputs of design perform their function (in design of materials it's possible to evaluate design outputs more directly). I recall a Head of Research and Development at Zeneca describing the process as "like steering an oil tanker".

I prefer to use the more general term ‘bioactivity’ to describe the effects of drugs on targets (and anti-targets) because in some cases these effects cannot be meaningfully described by a single parameter such as an IC₅₀ value. As an aside this is a good point at which to celebrate the recent FDA approval of the PROTAC Vepdegestrant for treatment of ESR1m, ER+/HER2- advanced breast cancer and I'll direct readers to this most excellent and timely review on targeted protein degradation. The concentration of a drug in contact with a target (or anti-target), which varies with time, is determined by dose, and by the drug’s absorption, distribution, metabolism, and excretion (commonly referred to as ADME). While the therapeutic and adverse effects of drugs are what the drug does to the body ADME is what the body does to the drug. Put another way, minimization of toxicity and optimizing ADME are entirely different objectives and I generally recommend that the acronym ADMET not be used.

Uncertainty is omnipresent in drug discovery and, despite what many appear to believe, AI/ML is not going to make this uncertainty vanish as if by magic. Derek was emphasizing the challenges presented by the complexity of biology long before AI came to be seen by some as a panacea for the ills of Pharma/Biotech (here’s a post from almost two decades ago and I also recommend reading his 2025 post on the “End of Disease” interview which also links relevant previous posts). The complexity of biology means that even if we knew the extent of target engagement in vivo (which varies with both dose and time) we wouldn’t generally be able to predict the in vivo effects of the drug with any confidence in the absence of other information. There is also uncertainty in exposure to consider and the concentration of a drug at its site(s) of action generally cannot be measured in vivo unless the target(s) are in direct contact with plasma. Uncertainty in exposure for intracellular targets is also a clinical development issue because failure in a Phase II trial may simply reflect inadequate exposure (we noted in KM2013 that “one can argue that a typical Phase I trial provides an incomplete description of distribution”). I recommend that everybody working in drug discovery and chemical biology read Smith & Rowland (2019) Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise DMD 47:667-672 DOI. I argue in NoLE that achieving controllability of exposure should be seen as an objective of drug design.

One way that pharmacokinetic pharmacodynamic (PK/PD) modellers address the issue of intracellular exposure is to assume that the concentration of drug in contact with its target(s) (and anti-targets) equals its unbound concentration in plasma (which can be measured in real time) and this assumption is referred to as the ‘free drug hypothesis’ (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’ because it’s an assumption we’re making). There are two scenarios under which the approximation of the concentration of drug at its site(s) of action by its unbound concentration in plasma is known to be unreliable. The first scenario is that there is significant active transport at one or more points on the path between plasma and the drug’s site(s) of action (active efflux is a common problem, especially in CNS drug discovery, although active influx will still cause the assumption to break down). The second scenario is that the pH at the drug’s site(s) of action differs from plasma pH (as would be the case for a lysosomal target) and that there is an ionizable group such as a basic nitrogen in the chemical structure of the drug.

While drug design does indeed have multiple objectives it really shouldn’t need to be said that if the required level of bioactivity cannot be achieved then it becomes irrelevant whether the other objectives are achieved and I’ll direct readers to M2026 (The Affinity Advantage). I see M2026 as providing a much-needed cold shower for a 2024 JMC Editorial (Property-Based Drug Design Merits a Nobel Prize; see blog post) in which it is asserted that “a discovery compound is more likely to become a drug when Fsp3 > 0.40” and that “a compound is more likely to have good developability when PFI < 7”. Nevertheless, I don’t consider M2026 to be especially useful from the perspective of defining drug design objectives because bioactivity is typically quantified by potency rather than affinity in drug discovery projects (an assay for kinase inhibition might have been run at high ATP concentration to mimic the intracellular environment) and some bioactivity objectives are defined in terms of measurements made in cell-based assays. Furthermore, bioactivity for ‘new modalities’ such as irreversible covalent inhibition and targeted protein degradation cannot be adequately described by a single parameter such as an IC₅₀ value.

I criticized the term ‘avoid-ome’ in a previous post and, with apologies for the dreadful pun, I would recommend that its use be avoided (at the risk of repetition ADME and toxicity are entirely separate issues that must be addressed separately). Furthermore, I would question whether drug designers actually need yet another ‘ome’ word and I consider the notion that embracing the avoid-ome will transform drug discovery to be fanciful. While inhibition of cytochrome P450 (CYP) enzymes is generally undesirable from a toxicity perspective a compound that was not cleared by these metabolic enzymes would greatly worry those responsible for drug safety (bear in mind why we worry about inhibition of CYPs in the first place). Furthermore, I would challenge the inclusion by M2026 of serum albumin in a list of anti-targets such as hERG (I’m not aware of anybody suffering cardiac arrest on account of their medication binding to serum albumin) and the excellent B2025 study notes that "most drugs are >95% plasma protein bound (58%), with a large fraction >99% bound (29%)". Binding to plasma proteins should actually be considered within the framework of distribution (it can be instructive to pose the question as to whether you could tell where a drug was simply from knowing the total quantity of it in the body and its unbound plasma concentration). It’s also worth mentioning that binding to plasma proteins will protect an orally-dosed drug from the metabolizing enzymes during its first pass through the liver (before it gets a chance to distribute into the tissues). Variation of the plasma concentration during the dosing interval for an orally-dosed drug is a necessary evil resulting from oral dosing and in many situations the ‘ideal’ pharmacokinetic profile would actually be that resulting from intravenous infusion (plasma concentration of the drug is maintained at a level required for therapeutically useful effects).

At this point I’ll attempt to articulate three general objectives of drug design (the only thing that I’m entirely confident about is here that I won’t get these exactly right). One of the great challenges that drug designers face is that it is usually difficult to identify compounds that simultaneously achieve all the design objectives. Specifying criteria for objectives too permissively increases the risk of choking in clinical development. However, overly stringent specification of criteria for objectives decreases the likelihood of achieving all of the objectives and will slow the discovery process. I state these objectives in terms of ‘bioactivity’ rather than ‘potency’ to accommodate ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation although, in many cases, it will be possible to quantify the bioactivity for a compound by a single IC₅₀ or EC₅₀ value. I use ‘maximize’ and ‘minimize’ (as opposed to ‘optimize’) to frame the objectives because there is generally no penalty for identifying better compounds than you think you need. Assessing how well objectives have been achieved involves running a diverse range of assays and, as noted in this blog post on the A2025 study, it is important to be fully aware of the quantitation limits for each and every assay that you use.

I'll conclude the post with what I would argue are the three objectives of drug design:

Maximize on-target bioactivity. This is the least difficult objective to specify because bioactivity characterized in the in vitro assays is likely to translate to target engagement in vivo provided that the compound can be presented to the target(s) at the required concentration. Design outputs are usually evaluated in animal models for the human disease before initiating studies in humans but the design itself is almost invariably done against in vitro end points.
Minimize off-target bioactivity. It is generally more difficult to specify objectives for off-target bioactivity than for on-target bioactivity on account of the numbers and diversity of the assays involved. Design outputs are always evaluated for toxicity in animals before initiating studies in humans (as mandated by regulatory authorities) but the design itself is almost invariably done against in vitro end points.
Maximize controllability of exposure. This objective, which might also be stated as 'Optimize ADME', is the most difficult of the three objectives to specify because, as noted earlier in this post, exposure generally can’t be measured for targets that are not in direct contact with plasma. At absolute minimum it is necessary to demonstrate that a pharmacokinetic profile can be achieved in animals that will maintain the (unbound) concentration of the compound at levels that we believe will result in beneficial therapeutic effects in humans. For targets not in contact with plasma the PK/PD modellers also need to be able to confidently invoke the free drug hypothesis (this is why I prefer to frame the objective in terms of exposure rather than ADME) and this requires that design outputs have good passive permeability and are not subject to active transport. In some cases it will also be necessary to demonstrate access to specific organs such as the CNS.

Comparing ML models in small molecule drug discovery

2026-04-21T06:51:00.020+01:00

To start the post I'll share a photo that I took in 2012 of incense sticks at the Truc Lam pagoda near Da Lat. Not long after taking this photo I lost a lens cap (although thankfully not the lens) riding a luge through a forest and would later visit a cricket farm (this was particularly welcome because I had developed a taste for fried crickets during a visit to Cambodia in 2005).

I’ll be reviewing A2025 (Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery) in this post. I consider the issues addressed by the authors to be extremely important and I think that the credibility of the Machine Learning (ML) field would be greatly enhanced if Editors declared words like 'outperform' to be verboten in manuscripts submitted to their journals. However, I will make a couple of criticisms of the study. First, ML modellers need to properly account for the number of adjustable parameters used to fit training data (the S2006 study goes further than this by arguing that one should also account for size of the descriptor pool). Second, ML modellers need to recognize that cross-validation can make optimistic assessments of model quality when there is high degree of clustering in training data. I’ll point you toward earlier Molecular design blog posts (Sep2024 | Oct2024 | Jul2025) that may be relevant to the discussion. As is usual for posts here at Molecular Design quoted text is indented with my comments italicised in red.

The ML models that form the focus of the A2025 study aim to predict properties (more generally behaviour) of compounds from their chemical structures. Although there is currently a lot of hype around ML models for drug discovery it’s worth bearing mind that people have been building quantitative structure-activity/property (QSAR/QSPR) models for decades (the inaugural EuroQSAR conference was held in Prague a mere five years after Czechoslovakia had been invaded by forces from the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). As I see it QSAR/QSPR approaches never really made much of a splash in real world drug discovery and my challenge to those who tout ML models as a panacea for the ills of Pharma/Biotech would be to ask why they think it’s going to be any different this time.

One of the difficulties that QSAR/QSPR practitioners faced when working within drug discovery project teams was that projects had often delivered (or had been put out of their misery) by the time there was enough data to build predictively useful models. It’s also worth pointing out that drug discovery teams have frequently delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building usefully predictive QSAR/QSPR models. Something that that many QSAR/QSPR practitioners never seemed to get is that much drug design is actually hypothesis-driven (I discussed this point 16 years ago in K2009 and I’ll point you to the P2012 article by former colleagues). A significant part of hypothesis-driven drug design is identification of exploitable features in structure activity/property relationships (SARs/SPRs) such as activity cliffs and instances of increased polarity not resulting in loss of potency. A simple plot of potency against lipophilicity might not be predictively useful but it can be still used to quantify the extent to the potency of the compound beats the trend in the data (see ‘Alternatives to ligand efficiency for normalization of affinity’ section in NoLE). My view is that hypothesis-driven drug design actually fits very naturally into an AI framework and those who tout AI as a drug design panacea appear to be missing a trick by seeing drug design as essentially an exercise in prediction.

Many of the properties of compounds of interest to ML modellers in drug discovery can be modelled as if they are equilibrium constants or rate constants (continuous-valued, dimensioned quantities) and typically fall into three general categories:

In vitro bioactivity is usually quantified in terms of potency (concentration at which a compound exhibits a specified effect in bioactivity assay) and, despite the views expressed in a rather bizarre JMC Editorial (a recent JMC Perspective provides a useful counterview and this blog post is also relevant), is the most important of the properties because you can’t compensate for inadequate potency by increasing quality of compounds or by making them more beautiful (see B2012) and I touch on this point in a recent blog post. It is important that ML modellers be aware that for some ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation the effect of a compound on the target depends on time as well as concentration. I discuss some of the issues that you need to think about when combining potency and affinity data for ML modelling of bioactivity in this blog post.
Properties considered to be relevant to ADME (absorption, distribution, metabolism, and excretion) include lipophilicity, aqueous solubility, permeability (both passive and active efflux) and plasma protein binding. While these properties are often described collectively as a compound's 'ADME profile' it's not actually accurate to do so because the ADME acronym refers to behaviour of compounds in vivo. Lipophilicity is the single most fundamental physicochemical property in drug design and it’s very important that ML modellers be aware that it's log D, rather than log P, that is measured and that the choice of octanol/water for log D measurement is entirely arbitrary.
Toxicity is typically assessed by measuring potency against anti-targets such as hERG and CYPs and cell-based assays are often used for assessment of toxicity. Generally it is more difficult to find suitable assay data for ML modelling of toxicity than is the case for modelling bioactivity against potential therapeutic targets. One reason for this is that responses in the cell-based assays commonly used to assess toxicity can't generally be linked to engagement of specific anti-targets (this is not to deny the value of the information provided by the assays for decision-making by drug discovery scientists). Furthermore, observations of potency in toxicity assays are likely to steer project teams away from the associated chemotypes and so it is very unlikely that ML modellers will encounter datasets for individual structural series with sufficient variance for building models.

When modelling properties of compounds that you believe to be relevant to small molecule drug discovery it’s important to bear in mind that even with a complete set of measured properties available it’s not generally feasible to predict what will happen when compounds are dosed in vivo. One reason for this is that the therapeutic (and adverse) effects of a drug are driven by its concentration at its site(s) of action which is a time-dependent quantity that cannot generally be measured in live humans. I argue in NoLE that the objective of the ADME-based aspects of drug design is actually to achieve controllability of exposure and one article that I recommend to all drug discovery scientists and chemical biologists is SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise).

A number of assays are available for measuring properties of interest in drug discovery and management of the ‘assay budget’ for projects is an important activity in drug discovery (especially when running assays is an outsourced activity). Drug discovery scientists typically use assays to identify and address specific design issues such as low solubility or unacceptable binding affinity for anti-targets.

In vitro assays used in drug discovery are generally configured for decision-making, rather than for building ML models, and in some cases what some might refer to as the ‘quality’ of the assay might be traded off against throughput (this doesn’t mean that the assays are somehow ‘bad’). In vitro drug discovery assays generally have both lower and upper quantitation limits and an assay’s dynamic range (you can draw an analogy between assays and analytical instruments) is given by the difference between the two values. Needless to say it is very important that ML modellers be fully aware of the lower and upper quantitation limits in the assays used to generate the data from which they will build models. This generally requires careful examination of assay details which might not have been captured by the curation processes used for databases such as ChEMBL (nor even been disclosed in the original publications). For example, maximum potency that can be quantified in a conventional enzyme inhibition assay is limited by the concentration of enzyme in the assay (see WM1979) and you’ll still need a 5 nM concentration of a picomolar inhibitor to achieve 50% inhibition of enzyme that is present in the assay at a concentration of 10 nM. I generally advise ML modellers to carefully examine the distributions in the datasets that they are modelling for evidence of cut offs that might indicate quantitation limits in the assays used to generate the data.

The effects of a drug in vivo are typically driven by its unbound concentration in plasma and assays for properties of interest in drug discovery are generally run in buffered aqueous media. It is well-known that measured values for physicochemical properties such as log D and aqueous solubility generally vary with pH for compounds with ionizable groups in their chemical structures. However, values measured for these properties can, in some scenarios, also depend on both the nature and concentration of counter-ion(s). This becomes an issue for log D measurement in cases where significant proportions of compounds are present in the organic phase in ionized forms and for aqueous solubility measurement when the measured value is limited by the solubility of a salt form (opposed to the neutral form). Dependence of measured property values on the nature and concentration of counter-ions is likely to be more of an issue when the degree of ionization (in aqueous media) is relatively high and my default advice is to consider pK_a when models underpredict log D or overpredict aqueous solubility values.

Before addressing what I consider to be the main problems with A2025 I’ll make some specific comments on the study. While these comments might appear to be pedantic (some might even use the term ‘nit-picking’) I would argue that the authors have raised the bar for themselves by claiming that their proposed “guidelines, accompanied by annotated examples using open-source software tools, lay a foundation for robust ML benchmarking and thus the development of more impactful methods”. By way of an example, if you're trying to persuade an analytical chemist to modify an aqueous solubility assay to make it more suitable for generating data to build ML models then it's not such a great idea to describe aqueous solubility as a molecular property or to confuse the range in a data set with the dynamic range of the assay used to generate the data.

In the Introduction (Section 1) the Authors state:

In drug discovery, expensive and time-consuming experiments are used to profile molecules [While it is common for drugs to be described as ‘molecules’, especially in promotional material, I generally recommend that ‘molecule’ not be used as a synonym for ‘compound’ in articles with a cheminformatic (or indeed a chemical) focus.] and gain insights into their therapeutic potential. Such experimental assays are typically organized in a cascade, where subsequent experiments test fewer molecules at a higher cost per molecule. As in silico surrogates to such experiments, both regression and classification Machine Learning (ML) models can be trained to estimate molecular properties [These are properties of compounds, as opposed to molecules, and should neither be described as ‘molecular properties’ nor as ‘small molecule properties’.] (i.e., experimental results) from chemical structure. Such models could inform drug design and prioritize experiments by scoring a set of candidate molecules. [The term ‘candidate molecules’ is as clumsy as it is inaccurate, and its meaning will not be clear to some readers. I recommend that the term ‘chemical structures’ be used instead.] These ML models thus inform high-stakes decisions [The ML models that are the focus of this study inform decisions as to which compounds should be synthesized and these decisions would not automatically be considered to be high-stakes decisions in contemporary drug discovery given developments in automation and high-throughput synthetic chemistry. It’s also important to be aware that in real life drug discovery many decisions to synthesize compounds are made with the knowledge that structural analogs have already been synthesized and shown to be active against the targets of interest. I would argue that genuinely high-stakes decisions, such as prioritization of compounds for in vivo studies, are only made after compounds have actually been synthesized and evaluated in relevant in vitro assays.] and help drug discovery research progress more quickly and efficiently. Hence, it is important that models provide reliable forecasting of experimental results.

In Section 3.3.1.3 (Dynamic Range) the Authors state:

Both correlation and error metrics are influenced by the dynamic range of the data being modeled. [I consider this use of the term ‘dynamic range’ to be incorrect and, as a reviewer, I would have pressed the Authors to explain the difference between the range of a data set and its dynamic range. As noted earlier I see dynamic range as a characteristic of an analytical instrument or an assay (which can be considered to be a type of analytical instrument) and I would argue that the term should not be applied to data sets. That said, it may be possible to infer the dynamic range of an assay through careful examination of the data.] Achieving a high correlation on data sets with a broader range of experimental values is generally easier, whereas data sets with a smaller dynamic range can produce unrealistically small values for error metrics. [While the range of a data set certainly imposes limits on variance it’s important to remember that measures of correlation are defined in terms of variance (as opposed to range) of the data. For a data set to be useful for building ML models the variance for replicate measurements needs to be small in comparison with the overall variance for the data set.] This can lead to deceptive conclusions.

With the pedantry (or nit-picking if you prefer) out of the way it’s time to take a look a what I consider to be the principal flaws of A2025. First, I consider it important to account for the number of adjustable parameters used to fit training data and, at very least, the authors should have acknowledged this as an issue. Second, I have concerns that cross-validation can lead to optimistic assessment of model quality when there is a high degree of clustering in training data and the a post from last year July might be relevant.

It’s well known that you can achieve a better fit to your data by simply using more adjustable parameters (I recommend that all ML modellers take a look at H2004 (DM Hawkins, The Problem of Overfitting, JCICS 2004 44:1-12) and my position is that it’s generally not meaningful to compare performance for models that differ in the number of adjustable parameters used to fit the training data without properly accounting for numbers of adjustable parameters. A criticism that I was making of the QSAR/QSPR field many years ago (long before ML modelling came to be touted as a panacea for the ills of Pharma/Biotech) was that many of those building models appeared to dismiss the accounting for numbers of adjustable parameters as a non-issue. It’s worth noting that building ML models typically involves selection of a subset of descriptors from a larger pool and the S2007 study argues that you also need to account for the number of descriptors in the pool when assessing model quality. Accounting for the number of adjustable parameters is not just an issue when you’re building ML models for small molecule drug discovery and this point is made in MHG2017 (Mardirossian and Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Molecular Physics, 115 2315–2372):

With semi-empirical density functionals, a measure that is commonly reported upon publication is the total number of parameters. Existing functionals based on the B97 concept have anywhere between 5 and 75 parameters. However, counting the number of parameters is often a confusing and unclear task.

The need to properly account for the number of adjustable parameters (the term 'degrees of freedom' is also used, especially in the older literature) when modelling data has been actually been recognised for many years. The agrarian economist Mordecai Ezekiel (1899-1974), who shaped much of FDR’s agricultural policy, introduced adjusted R² (link1 | link2) in Methods of Correlation Analysis which was published in 1930. The F-test (link1 | link2) can be used to assess whether the use of additional adjustable parameters is justified although I’m not aware of exactly when this particular use of the F-test was introduced. It’s also worth pointing out that Akaike information criterion (AIC) and Bayesian information criterion (BIC) appeared in the statistics literature in 1974 and 1978 respectively. I certainly wouldn’t claim to have comprehensively reviewed the importance of accounting for number of adjustable parameters when comparing ML model performance nor am I suggesting that this is something that would be easy to do. Nevertheless, I do hope that it's clear that this is not something that can simply be swept under the carpet (or even ejected from the window of an upper floor Moscow apartment).

This is a good point at which to say something about validation of ML models and I would argue that is actually very difficult to demonstrate objectively that one protocol for validation is better than another. Two general approaches for validation of ML models are to use cross-validation and to split data into a training set and an external test set (that the model never sees). A view that I’ve held since the late 1990s is that many ‘global’ models for predicting properties of compounds relevant to drug discovery are actually ensembles of local models (this view was expressed publicly in the B2009 study). I would anticipate that clustering in data sets will cause cross-validation to give optimistic assessments of model quality which in turn can lead to overfitting. I would also expect principal component analysis (PCA) to be less meaningful for highly clustered data (this is relevant because correlations between chemical structure descriptors need to be accounted for in order to calculate meaningful distances between chemical structures in the space). Something that I do need to make clear is that ‘clustering’ in the context of this post simply refers to distribution within the chemical structure descriptor space of a model.

The Authors of A2025 recommend "using a 5 × 5 repeated cross-validation procedure to sample the performance distribution” and one point that I’ll make is that they haven’t demonstrated that this protocol is more effective than 4 × 4 repeated cross-validation or 6 × 6 repeated cross-validation. While this might appear to be nit-picking I will point out that it would not be valid to invoke A2025 if criticising a future ML modelling study for using 4 × 4 repeated cross-validation (bear in mind that a substructural match against even a single PAINS filter would be considered by some to constitute the basis for a valid criticism in medicinal chemistry and K2017 might be of interest in this context).

The general approach to cross-validation is to repeatedly split the data into training sets and test sets before assessing how well on average the test data are predicted (algorithms differ as to exactly how this is done). When there is a high degree of clustering the data splits are likely to retain some members for each cluster in the training sets which can ‘anchor’ the models. Here’s what H2004 has to say:

If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead’ compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.

Another approach to validating ML models is to use external test sets although this can still lead to optimistic assessments of model quality when the available data are highly clustered. One advantage of this approach to validation is that external test sets can be ‘structured’ to provide a more detailed view of model performance (one criticism that I would make of cross-validation is that it gives a rather ‘one-dimensional’ assessment of model performance). One way to structure test sets is to characterize (by size and closeness) the neighbourhood within the training set for each object in the test set. The motivation for structuring the test sets in this manner is that it enables you to analyse relationships between prediction performance and the degree of coverage of space around test set objects by training set data. There are, however, other ways to structure test sets and my view is that classifying test set compounds according to whether they are neutral, cationic or anionic would potentially be informative when assessing models for log D, aqueous solubility, permeability, plasma protein binding, volume of distribution and hERG blockade. Although it’s not directly relevant to this post I would generally recommend that ML model predictions be presented to users along with training set data for the nearest neighbours in the model space and the most similar chemical structures in the training set.

This is a good point at which to wrap up and I concede that it’s difficult to account for numbers of adjustable fitting parameters and to meaningfully validate models when distributions of objects within the relevant chemical spaces are very uneven. That said, I would argue that creators of ML models do at least need to acknowledge these issues given that many tout models like these as essential for AI-based drug design.

Anticipating a future blog post on chemical space coverage I'll finish the post by noting that coverage is also of historical relevance. The B-52 in the photo is not in the best state of repair and this shouldn't surprise you because I took the photo during a 2005 visit to Hanoi. In those days it was considered to be good form to show disrespect for the enemy's military hardware and so I gave the wreckage a good kick. I also paid my respects to Uncle Ho whom I’m told is in much better shape than Chairman Mao (owing to the then frosty Sino-Soviet relations the latter was pickled by inexperienced compatriots rather than by the Russian experts who had pickled the former and it is said that the embalming team arrived from Moscow before Uncle Ho had actually expired). A few days later in Dien Bien Phu I caused a minor consternation by demonstrating that that the barrel of an American-made 155 mm howitzer that had been captured from the French in 1954 could still be elevated (admittedly it was a little stiff). Apparently, the French had asked the Americans if they would be so kind as to drop lots of bombs (or perhaps one very big bomb) on the Viet Minh but President Eisenhower wisely denied the request. The B-52 in the photo was one of a number sent by President Nixon (who had been President Eisenhower’s VP) to bomb North Vietnam during Operation Linebacker II (aka the Christmas Bombings) and it's my understanding that all crew members survived their encounter with the SAM.

PAINS and Prejudice

2026-04-01T07:04:00.005+01:00

<< previous || next >>

PAINS (pan assay interference compounds) filters have exerted a hold over the drug discovery community ever since the BH2010 study appeared over 15 years ago. Initially I didn’t take much notice of PAINS filters and, in any case, I’d already moved on from analysis of high-throughput screening (HTS) output by that point (I might add ‘thankfully’ because looking at too much HTS output is a sure-fire route to the funny farm). I started analysing HTS output from about 1993 at what was then Zeneca. I used the Daylight toolkit to create the Struct_Anal SMARTS-based chemical structure profiler in 1995 and, at that time, we were already using in house software named Flush (even at that stage it was clear that much of the HTS output being generated was going to disappear round the S-bend and our friends at what was then Rhône-Poulenc Rorer developed HARPick to ensure that nothing remained stuck to the porcelain).

Photo from 2011 at 'The Black Hole' (Los Alamos NM)

Something that had always worried me was that it was very easy to opine that a compound looked nasty but it was much more difficult to demonstrate objectively that the compound was indeed nasty. Late in 2014 a blog post, which fell well short of the standards that the drug discovery community has come to expect from Practical Fragments, prompted me to take a more forensic look at PAINS filters. What I found was that PAINS filters were based on the output from screening compounds in just six AlphaSceen assays (if a panel of six assays that all use the same read-out strikes you as suboptimal design of an experiment to detect pan-assay interference then you’re not alone). After blogging periodically about PAINS filters for a couple of years I wrote a Perspective on the topic (as noted in this blog post: from time to time, every blogger should write a journal article “pour encourager les autres”).

Nevertheless, doubts about the correctness of my position started to creep in when I was denounced for being insufficiently thoughtful in my published comments on PAINS by the authors, one of whom is a former colleague, of the seminal, insightful and Nobel-worthy ‘Seven Year Itch’ article (BN2017) which oozes wisdom and penetrating insight. Although stung by the criticism and wracked by self-doubt to the extent that I considered therapy, it was a recent study led by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, working in collaboration with the Budapest Enthalpomics Group (BEG), that removed any lingering doubts about the sublime elegance and extreme predictivity of PAINS filters. The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (not sure exactly what she’s doing these days although I understand that she’s currently visiting Port-au-Prince for a medication review with Prof. Bouchard-Duvalier). There is no doubt that this genuinely disruptive study will comprehensively reshape the predictive biochemistry landscape, enabling drug discovery scientists to accurately, meaningfully and robustly predict assay interference using only chemical structures as input for the very first time.

Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that singlet oxygen quenching is actually a conserved feature for all known and unknown mechanisms of interference with assay read-outs and that PAINS filters dramatically outperform all other methods for prediction of assay interference. The math is truly formidable (the rudimentary nature of my understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barron-Samedy transformation, followed by hepatic eigenvalue extraction using a the elegant algorithm devised by E. V. Tooms (a reclusive Baltimore resident and connoisseur of liver pâté whose illustrious thought leadership of the analytic topology field unravelled almost 32 years ago after he failed to comply with the safety instructions for an escalator). The incisive analysis of Prof. Bouchard-Duvalier shows without a shadow of doubt that singlet oxygen quenching as quantified by the AlphaScreen assay read-out is a fundamental principle in biomolecular assay science. Furthermore, ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which the grinning BEG director Prof. Kígyó Olaj explains:

Possibilities are limitless now that we can accurately and robustly predict the assay interference that compounds will exhibit directly from their chemical structures and we can safely consign experimental biochemical assays to the dustbin of history. Surely the Journal of Medicinal Chemistry Editors will now finally recognize the colossal impact that PAINS filters have made on real world drug discovery and development when they make their FIFA Prize nominations later this year.

Hit to Lead best practice?

2025-12-31T17:22:00.059+00:00

I'm now in Trinidad and I'll share a 180° panorama from Paramin where I walk for exercise. This district in Trinidad's Northern Range is renowned for its agriculture and the most excellent produce is grown in 'gardens' on steep hillsides. My walk would take about two and a quarter hours if I just walked but it usually takes rather longer because I like to take photos and often stop on the ridge to gaze at corbeaux 'surfing' the updrafts. Most of all I enjoy catching up with friends in Paramin and not so long ago one of them was telling me about the sound made by douens (which have terrified me since childhood because I was never baptised). Some years ago I was struggling along the ridge with a hacking cough that I'd brought with me from the UK three days previously when I heard a familiar voice (one of my friends was visiting his sister). The conversation turned to my cough and he instructed his sister to bring some medicine. She produced a bottle of a liquid that looked like fluorescein and, as she decanted some into a shot glass my friend exclaimed "dat too much yuh go kill he". The liquid appeared to have a puncheon base and my friend's sister also gave me some bush to make tea. My cough was history after three days.

I’ll be taking a look at The European Federation for Medicinal Chemistry and Chemical Biology (EFMC) Best Practice Initiative: Hit to Lead (Q2025) in this post. I have a number of criticisms of this work and it really shouldn’t need saying that you do raise the bar for yourself when you present your work as defining best practices. As is customary for blog posts here at Molecular Design I’ve used Q2025 reference numbers when referring to literature studies and quoted text is indented with my comments in red italics. This will be a long tedious post and strong coffee is recommended.

Best practices are, in essence, recommended ways of doing things and it’s actually very difficult to demonstrate objectively that one way of doing things is better (or worse) than another way. My general view of Q2025 is of a poorly organized article that at times lacks clarity and coherence. Some of the advice offered on how best to do Hit to Lead (H2L) work is unsound and the Authors also make a number of significant errors. Although the abstract refers to “contemporary drug discovery” the recommended best practices do, in my view, appear to be firmly rooted in the past given that that fragment-based design (FBD) is not covered and there is no mention of important 'new' modalities such as irreversible covalent inhibition and targeted protein degradation. It’s worth mentioning that biological activity for some new modalities cannot be meaningful quantified as a single parameter such as an IC₅₀ value and this complicates the use of ligand efficiency metrics (a post on covalent ligand efficiency will give you an idea of the tangles you can get yourself into) which the Authors seem to consider important in H2L work. I consider the quantity of literature cited in Q2025 to be excessive, especially given that some of the cited articles have minimal relevance to H2L work (the failure of the Authors to cite R2009 is also noteworthy). In some cases the cited literature does not support assertions made by the Authors. In my view Figures 1, 5 and 8 are redundant.

While I see plenty wrong with Q2025 it’s worth flagging up points on which the Authors and I appear to be in agreement. I think that they put it well with the following statement:

Leads have line of sight to a development candidate and bring an understanding of what priorities Lead Optimisation should address.

I used this football analogy in an earlier post:

The screening phase is followed by the hit-to-lead phase and it can be helpful to draw an analogy between drug discovery and what is called football outside the USA. It’s not generally possible to design a drug from screening output alone and to attempt to do so would be the equivalent of taking a shot at goal from the centre spot. Just as the midfielders try move the ball closer to the opposition goal, the hit-to-lead team use the screening hits as starting points for design of higher affinity compounds. The main objective in the hit-to-lead phase is to generate information that can be used for design and mapping structure-activity relationships for the more interesting hits is a common activity in hit-to-lead work.

I certainly agree that it is important to establish structure-activity relationships (SARs) for structural series of interest although I have no idea what the Authors mean by “dynamic SAR”. I also agree that consideration of physicochemical properties, especially lipophilicity, is very important in H2L work (just as it is in optimisation of the leads) although the case for a Nobel Prize made in a 2024 JMC Editorial does, in my view, appear to have been overcooked.

I argue that drug discovery should be seen in a Design of Experiments framework (generate the information that you need as efficiently as possible) rather than as the prediction exercise that many who tout machine learning (ML) as a panacea for the ills of Pharma & Biotech would have you believe. Regardless of which view prevails it’s abundantly clear that generation and analysis of data are very important in contemporary drug discovery and are likely to become even more important in the future). However, if you’re going to base decisions on trends in data then it’s important that you know how strong the trends are because this tells you how much weight to give to the trends when making your decisions. Most drug discovery scientists will have encountered analyses of relationships between predictors of ADME (absorption, distribution, metabolism, and excretion) and physicochemical and chemical structure descriptors and we observed in the KM2013 perspective that:

The wide acceptance of Ro5 provided other researchers with an incentive to publish analyses of their own data and those who have followed the drug discovery literature over the last decade or so will have become aware of a publication genre that can be described as ‘retrospective data analysis of large proprietary data sets’ or, more succinctly, as ‘Ro5 envy’.

In some cases trends observed in data are presented in ways that make them appear to be stronger than they actually are (this is typically achieved by categorizing continuous-valued data prior to analysis) and [13a], [24] and [26] were criticised in this context in KM2013. When reading articles on drug-likeness and compound quality it is also important to be aware that correlation does not imply causation. One should be particularly wary of of studies such as [20c] which present analyses of proprietary data as "facts" or claim that such analyses have revealed "principles". I see the weakness of these trends partly as a reflection of chemical structure diversity in datasets and would expect the corresponding trends to be stronger within structural series (I offer the following advice in NoLE):

Drug designers should not automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.

I see erosion of critical thinking skills as a significant problem in contemporary drug discovery and some leaders in the field appear to have lost the ability to distinguish what they know from what they believe. As I observed in a review of a 2024 JMC Editorial (Property-Based Drug Design Merits a Nobel Prize) the Rule of 5 (Ro5) is not actually supported by data in the form that it was stated. The wide acceptance of Ro5 as a definition of drug-likeness propagates what I consider to be a misleading view that drugs occupy a contiguous and distinct region of chemical space. Some of the claims made in the JMC Editorial (“a compound is more likely to be clinically developable when LipE > 5”, “a discovery compound is more likely to become a drug when Fsp3 > 0.40” and “a compound is more likely to have good developability when PFI < 7”) do not appear to be based on data. I remain sceptical that developability and likelihood of clinical success of a compound can be meaningfully assessed even when one knows that the compound actually exhibits exploitable activity against the target(s) of interest. In my view the suggestion that simple drug discovery guidelines are worthy of a Nobel Prize does a huge disservice to drug discovery scientists by trivializing the very significant challenges that they face.

Like many in the drug discovery field, I consider lipophilicity to be the single most important physicochemical property in drug discovery and I would generally anticipate that a surfeit of lipophilicity will end in tears. That said, I don't consider lipophilicity to be usefully predictive of physicochemical properties such as permeability and aqueous solubility that are more relevant than lipophilicity from the perspective of oral absorption. When I assert that lipophilicity is not "usefully predictive" I'm certainly not denying that trends in data exist. However, I must stress that the trends are not so strong that having solubility values that have been predicted from lipophilicity means that you no longer need to measure aqueous solubility.

In drug discovery projects I generally recommend examination of the response of potency (expressed as a logarithm) to increased lipophilicity. In the ideal situation the correlation of potency with lipophilicity will be weak, indicating that potency is driven by factors other than lipophilicity. If the correlation of potency with lipophilicity is strong then you need the response (the slope for a linear correlation) to be relatively steep. I consider it to be generally helpful to plot potency against lipophilicity with reference lines corresponding to different LipE values (see R2009 which is a lot more relevant to H2L work than much of the literature cited in the Q2025 study) and I would also suggest modelling the response and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the "Alternatives to ligand efficiency for normalization of affinity" section of NoLE).

In drug discovery lipophilicity is usually quantified by the logarithm of the octanol/water partition coefficient (log P) or distribution coefficient (log D). The choice of octanol/water for quantification of lipophilicity is arbitrary and some, including me, consider saturated hydrocarbons such as cyclohexane or hexadecane to be physically more realistic than octanol as a model for the core of a lipid bilayer. It is the distribution coefficient (D) rather than the partition coefficient (P) that is measured for lipophilicity assessment although the two quantities are equivalent when ionization can be safely neglected. Values of logP for ionizable compounds can be derived from the response of log D to pH although this is not generally done routinely in in drug discovery. Alternatively, you can make the assumption that only neutral forms of compounds partition into the organic phase and use (1) in the H2L best practices post graphic (see also K2013) to convert log D values to log P values (to do this you’ll also need a reliable estimate for pKa in order to calculate the neutral fraction). When log D (as opposed to log P) is used to assess the ‘quality’ of compounds you can make compounds better simply by increasing the extent to which they are ionized and I hope you can see that going down this path is likely to end as well as things did for the Sixth Army at Stalingrad.

In drug discovery log P values are typically calculated and it can often be quite difficult when reading the literature to know which method has been used for the calculations (sometimes the term ‘cLogP’ appears to have been used simply to denote that log P values have been calculated). For example, it is stated in [13a] that “Physical property data were obtained from AstraZeneca’s C-Lab tool, incorporating standard packages for LogP calculations (cLogP, ACDLogP), and an in-house algorithm for the distribution coefficient (1-octanol–water LogD at pH 7.4)”. In general, different prediction methods will give different log P values for the same compound (for example the Ro5 lipophilicity threshold is 5 when ClogP is used but 4.15 when MlogP is used). That said, choice of method for predicting log P and whether you use measured log D or predicted log P become less important issues when working within structural series because hydrogen bond donors and acceptors, and ionizable groups tend to be relatively conserved under this scenario.

That log D and log P are different quantities in the context of drug design is one of a number of things that the Authors of [34a] (Molecular Property Design: Does Everyone Get It?) just don’t seem to ‘get’ and I’ll point you toward a blog post in which this point is discussed in a bit more detail. Let’s examine Figure 2 (Impact of hydrophobicity on developability assays and the profile of marketed oral drugs) of [34a] and I’d like you to look at the upper panel (a). You’ll notice that the visualization for some of the ‘developability’ assays is based on PFI (derived from log D measured chromatographically at pH 7.4). However, the visualization for hERG (+1 charge) and promiscuity is based on iPFI (derived from ‘Chrom logP’ and it is not clear how this quantity was defined or generated). I would also argue that the activity criterion (pIC₅₀ > 5) used in the promiscuity analysis is too permissive to be physiologically relevant (this is a common issue in the promiscuity literature). As an aside, I am unconvinced that log D values were actually measured chromatographically at pH 7.4 for all the drugs that form the basis of the analysis shown in the lower panel (b) of Figure 2.

After a long preamble it’s time to start my review of Q2025 and comments will follow the order of the article. I see the citation of [2] and [3] as gratuitous while [4] does not appear to present evidence in support of the claim that “ensuring high quality of lead series is a large cost and time saver in the overall process of drug discovery” (it must be stressed that I certainly don’t deny the value of high quality lead series and am merely pointing out that the chosen reference does not actually demonstrate that higher quality of lead series result in cost and time savings in drug discovery).

In my view neither Figure 1 nor its caption (see below) makes any sense.

Figure 1. Illustration of the multi-objective characterisation necessary in the journey from a hit to a drug. All these necessary characteristics, described by illustrative principal components, are influenced by the physicochemical properties of the molecules.

You’ll frequently encounter graphics like Figure 1 that show low-dimensional chemical spaces in the drug discovery literature (for example, a 2-dimemsional space might be specified in terms of lipophilicity and molecular size). While it’s very easy to generate graphics like these the relevance of the chemical spaces to drug design is often unclear. There are ways in which you can demonstrate the relevance of a chemical space to drug design and, for example, you might build usefully predictive models for quantities such as IC₅₀, aqueous solubility or permeability using only the dimensions of the particular chemical space as descriptors. Alternatively, you could show that compounds in mutually exclusive categories such as ‘progressed to phase 2’ and ‘failed to progress to phase 2’ occupy different regions of the chemical space (note that it’s not sufficient to show that a single class of compounds such as ‘approved drugs’ occupies a particular region within the chemical space and this is the essence of a general criticism that I make of Ro5 and QED). It is common to depict the different categories as ellipses that enclose a given fraction of the data points corresponding to each category and the orientation of each ellipse with respect to the axes indicates the degree to which the descriptors that define the chemical space are correlated for each category. One problem with Figure 1 is that the meaning of the ellipses is unclear and I would challenge the assertion made by the Authors that “the journey of a drug discovery campaign is characterized in Figure 1, showing how the active hit needs to be modified to address the requirements impacting the efficacy and safety of the molecule”.

Potency optimisation alone is not a viable strategy towards the discovery of efficacious and safe drugs, or even high-quality leads. Concurrent optimisation of the physicochemical properties of a molecule is the most important facet of drug discovery, as these properties influence its behaviours, disposition and efficacy [12a | 12b]. [While I certainly agree that there is a lot more to drug design than maximisation of potency I would argue that controlling exposure is a more important objective than optimization of physicochemical properties (on the subject of exposure I recommend that all drug discovery scientists take a look at the SM2019 article). It's also worth bearing in mind that you can't compensate for inadequate potency with increased compound quality. I don't consider either reference as evidence that "concurrent optimisation of the physicochemical properties of a molecule is the most important facet of drug discovery" and it is not accurate to describe metabolic stability, active efflux and affinity for anti-targets as "physicochemical properties". I think the Authors need to say more about which physicochemical properties they recommend to be optimized and be clearer about exactly what constitutes optimization. Lipophilicity alone is not usefully predictive of properties such as bioavailability, distribution and clearance that determine the effects of drugs in vivo.] Together these outcomes define the quality of the molecule, indicative of its chances of success in the clinic, as evidenced in numerous studies [13a | 13b]. [Neither of these articles appears to provide convincing evidence of a causal relationship between “the quality of a molecule” and probability of success in the clinic. Much of the 'analysis' in [13a] consists of plots of median values without any indication of the spreads in the corresponding distributions and to see it cited in connection with "evidenced" rings alarm bells for me. As explained in KM2013 presenting data in this manner exaggerates trends and I consider it unwise to base decisions on data that have been presented in this manner. Quite aside from from the issue of hidden variation I do not consider the relationship between promiscuity and median cLogP reported (Figure 3a) in [13a] to be indicative of probability of success in the clinic, given that the criterion for 'activity' ( > 30% inhibition at 10 µM) is far too permissive to be physiologically relevant (this is a common issue in the promiscuity literature).]

While the optimal lipophilicity range has been suggested as a log D_7.4 between 1 and 3, [15] this is highly dependent on the chemical series. [The focus of the analysis was permeability and the range was actually defined in terms of AZlogD (calculated using proprietary in-house software) as opposed to log D measured at 7.4. The correlation between the logarithm of the A to B permeability and AZlogD is actually very weak (r² = 0.16) which would imply a high degree of uncertainty in threshold values used to specify the optimal lipophilicity range. While I remain sceptical about the feasibility of meaningfully defining optimal property ranges the assertion that the proposed range in AZlogD of 1 to 3 “is highly dependent on the chemical series” is pure speculation and is not based on data.] Best practice would be to generate data for a diverse set of compounds in a series, if measuring it for all analogues is not possible, and determine the lipophilicity range that leads to the most balanced properties and potency [3 | 16]. [It is not clear what the Authors mean by “most balanced properties and potency” nor is it clear how one is actually supposed to use lipophilicity measurements to objectively “determine the lipophilicity range that leads to the most balanced properties and potency”. My view is that to demonstrate "balanced properties and potency" would require measurements of properties such as aqueous solubility and permeability that are more predictive than lipophilicity of exposure in vivo. I do not consider either [3] or [16] to support the assertions being made by the Authors.] Lipophilicity and pKa prediction models can then guide further designs and synthesis of analogues along the optimisation pathway (Figure 3 [17]). but measurements are advised, particularly by chromatographic methods, such as Chrom log D_7.4, in [18] contemporary practice. [In general, it is very difficult to convincingly demonstrate that one measure of lipophilicity is superior to another. Chromatographic measurement of log D is higher in throughput than the shake flask method used traditionally but it is unclear as to which solvent system the measurement corresponds. Furthermore, the high surface area to volume area of the stationary phase means that an ionized species can interact to a significant extent with the non-polar stationary phase while keeping the ionized group in contact with the polar stationary phase and one should anticipate that the contribution of ionization to log D values might be lower in magnitude than for a shake flask measurement.]

As noted earlier in the post I consider it helpful to plot (as is done in Figure 3 which also serves as the graphical abstract) potency against lipophilicity with reference lines corresponding to different LLE (LipE) values (see R2009 which really should have been cited) to be a good way for H2L project teams to visualize potency measurements for their project compounds. That said, I consider view of the discovery process implied by Figure 3 to be neither accurate nor of any practical value for scientists working on H2L projects. It is relatively easy to define optimization of potency and measurements in an vitro assay are typically relevant to target engagement in vivo (uncertainty in the concentration of the drug in the target compartment, and of the species with which it competes, is likely to be the bigger issue when trying to understand why in vitro potency fails to translate to beneficial effects in vivo). One specific criticism that I will make of the Figure 3 is that it appears to imply that it doesn't matter whether you use log P or log D (when you use log D you can reduce lipophilicity to acceptable levels simply by increasing the extent to which compounds are ionized).

However, there is quite a bit more to optimization of properties such as permeability, aqueous solubility, metabolic stability and pharmacological promiscuity that are believed to be predictive of ADME and toxicity, and my view is that defining optimization in terms of determining "the lipophilicity range that leads to the most balanced properties and potency" to be hopelessly naive. The principal objective in H2L work (and in lead optimization) is to identify compounds for which potency and properties related to ADME and toxicity are all acceptable. Defining meaningful acceptability criteria is non-trivial and H2L teams also typically need to make decisions as to how criteria can be relaxed with a minimum of risk. It's important to be aware that you can't compensate for inadequate potency by making the other properties better and those who argue that drug discovery scientists should focus on lipophilic efficiency rather than potency are missing this point.

While plotting potency against lipophilicity with reference lines corresponding to different LLE (LipE) values is often a helpful way to visualise project data in H2L (and in lead optimization) I don't consider Figure 3 to provide an accurate or useful view of the typical H2L process. Figure 3 presents a view that a hit maps to a lead which in turn maps to a drug candidate. In reality the screening phase of a discovery project will identify multiple hits and the resulting leads are not single compounds but structural series. It is important to be aware that the practical (as opposed to conceptual) utility of a graphic such as Figure 3 is limited by the extent to which the chosen measure of lipophilicity is predictive of properties such as aqueous solubility, permeability and metabolic stability.

Although Q2025 claims to define H2L best practices the Authors don't appear to demonstrate much awareness of the nature of the H2L process. The first step in the H2L process is to follow up hits from the initial screen by assaying potential compounds of interest (summarised in Figure 2) although and in some cases some follow up might have already been done in the hit generation phase. Hits tend to group into structural families and the H2L chemists then synthesise compounds (in some organizations synthesis is outsourced) with a view to identifying compounds that are more potent that the hits. Decisions as to which compounds are to be made are typically hypothesis-based (see P2012) although in some cases genuinely predictive models might be available to the H2L team. Design hypotheses are typically based on information available to H2L teams, such as SARs derived from the hits or relevant target structures, and predictive models might be based on free energy calculations (see ASC2025). As the H2L teams generate more information design hypotheses become more specific and models based on project data become more predictive.

I would argue that establishing (and exploiting) SARs and structure-property relationships (SPRs) constitutes a basis for design in H2L work. Certain features of SARs are especially relevant to H2L work and an observation that a reduction in log P leads to increased potency (or at least a minimal decrease in potency) is information that project teams can make good use of. Other SAR features that I would advise H2L scientists to look for are activity cliffs (relatively small changes in structure result in relatively large changes in potency) and superadditivity (effect on potency of simultaneously making two structural modifications is greater than what would be expected from the effects of making each structural modification individually).

I see managing the 'assay budget' as a critical activity (especially when running assays is outsourced). For example, differences in lipophilicity between structurally related compounds are typically easy to predict and measuring large numbers of log D values is likely to be wasteful of resources. H2L teams need to use their assay budgets to identify and address issues efficiently and I don't consider the suggestion that H2L teams use a generic tiering approach such as the one shown in Figure 9 to be especially helpful. Something that I do suggest H2L teams consider is to try to assess responses of properties such as aqueous solubility and permeability to lipophilicity (this means making measurements for less potent compounds).

Figure 3. There are numerous routes to climb a mountain, as there are to discover a drug, but a measured approach to lipophilicity will guide an optimal path, [The Authors need to articulate what they mean by “a measured approach to lipophilicity” (which does come across as arm-waving) and provide evidence to support their claim that it “will guide an optimal path”.] where the outcome is usually driven by a balance of activity and lipophilicity [This appears to be a statement of belief and the Authors do need to provide evidence to support their claim. The Authors also need to say more about how the “balance of activity and lipophilicity” can be objectively assessed.] (The parallel lines represent LLE, i.e. pIC₅₀ - log P). [This way of visualizing data was introduced in the R2009 study which, in my view, should have been cited.]

Thus the Distribution Coefficient, (log D at a given pH) is a highly influential physical property governing ADMET profiles [20a | 20b | 20c] such as on- and off-target potency, solubility, permeability, metabolism and plasma protein binding (Figure 4) [14b]. [I recommend that the term ‘ADMET’ not be used in drug discovery because ADME (Absorption, Distribution, Metabolism, and Excretion) and T (Toxicity) are completely different issues that need to be addressed differently in design. I would argue that the ADME profile of a drug is actually defined by its in vivo characteristics such as fraction absorbed (which may vary with dose and formulation), volume of distribution and clearance (the Authors appear to be confusing ADME with in vitro predictors of ADME) and I would also argue that toxicity is an in vivo phenomenon. In order to support the claim that log D “is a highly influential physical property governing ADMET profiles” it would be necessary to show that log D is usefully predictive of what happens to drugs in vivo. My view is that the cited literature does not support the claim that log D “is a highly influential physical property governing ADMET profiles” given that [20a] does not even mention log D and neither [20b] nor [20c] provides any evidence that log D is usefully predictive of in vivo behaviour of drugs.]

Figure 4. The impact of increasing lipophilicity on various developability outcomes [14b] [It is unclear as to whether lipophilicity is defined for this graphic in terms of log P or log D. It would be necessary to show more than just the ‘sense’ of trends for the term “impact” to be appropriate in this context. I do not consider the use of the term “developability outcomes” to be either accurate or helpful.]

Aqueous solubility is certainly an important consideration in H2L work although I think that the Authors could have articulated the relevant physical chemistry rather more clearly than they have done. You can think of the process of dissolution as occurring in two steps (sublimation of the solid followed by transfer from the gas phase to water). Lipophilicity usually features in models for prediction of aqueous solubility although I consider wet octanol to be a thoroughly unconvincing model for the gas phase. We generally assume that aqueous solubility is limited by the solubility of the neutral form (which is why ionization tends to be beneficial) but when this assumption breaks down the solubility that you measure will depend on both the nature and concentration of the counter-ion. As I note in HBD3 optimization of intrinsic aqueous solubility (the solubility of the neutral form of the compound) is still a valid objective for ionizable compounds because we're typically assuming that only neutral species can cross the cell membrane by passive permeation.

Some general advice that I would offer to drug discovery scientists encountering solubility issues is that they should try to think about molecular structures from the perspectives of molecular interactions in the solid state and crystal packing. I would expect the left hand 'Reduce crystal packing' structure in Figure 6 to be able to easily adopt a conformation in which the planes corresponding to the aromatic rings and amide are all mutually coplanar (this is a scenario in which a non-aromatic replacement for an aromatic ring might be expected to have a relatively large impact). In HBD3 I suggest that deleterious effects of aromatic rings on aqueous solubility might be due to molecular interactions of the aromatic rings rather than their planarity. I also suggest in HBD3 that elimination of non-essential hydrogen bond donors be considered as a tactic for improving aqueous solubility because it tends to increase the imbalance between hydrogen bond donors and acceptors while minimizing the resulting increase lipophilicity.

Rational [this use of "rational" is tautological] reasons for poor solubility were succinctly described by Bergstrom, who coined "Brick Dust and Greaseballs" as two limiting phenomena in drug discovery [22] which are in line with the empirical findings that led to General Solubility Equation [23] (Figure 5). [I don’t consider the General Solubility Equation to have any relevance to H2L work because it has not been shown to be usefully predictive of aqueous solubility for compounds of interest to medicinal chemists and the inclusion of Figure 5, which merely shows how predicted solubility values map on to an arbitrary categorisation scheme, appears to be gratuitous.] Succinctly, three factors influence solubility: lipophilicity, solid state interactions and ionisation. [It is solvation energy as opposed to lipophilicity that influences solubility and wet octanol is a poor model for the gas phase.] Determining which are the strongest drivers of low solubility will guide the optimisation (Figure 6). Using the analysis in Figure 5 the Solubility Forecast Index emerged, using the principle that an aromatic ring is detrimental to solubility, roughly equivalent to an extra log unit of lipophilicity for each aromatic ring (Thus SFI = clog D_7.4+ #Ar) [24]. [I consider the use of the term “principle” in this context to to be inaccurate given that that the basis for SFI is subjective interpretation of a graphic generated from proprietary aqueous solubility data and I direct readers to the criticism of SFI in KM2023.] Minimising aromatic ring count is an important and statistically significant metric to consider [25] [The importance of minimizing aromatic ring count is debatable and it is meaningless to describe metrics as “statistically significant”.] - consistent with the "escape from flatland" concept [26] that focusses on increasing the sp³ (versus sp²) ratio in molecules, [The focus in the “escape from flatland” study is actually on the fraction of carbon atoms that are sp3 (Fsp3) and not on “the sp³ (versus sp²) ratio”.] even though no significant trends are apparent in detailed analyses of sp³ fractions [27]. [The “analyses of sp³ fractions” in [27] consist of comparisons of drug - target medians for the periods 1939-1989, 1990-2009 and 2010-2020 and all appear to be statistically significant (although I don't consider these analyses to have any relevance to H2L work). I consider the citation of [27] in this context to be gratuitous and this blog post might be of interest.]

An important factor in hit selection is to prioritise compounds with higher ligand efficiency. Ligand efficiency, defined as activity [LE is actually defined in terms of Gibbs free energy of binding and not activity.] per heavy atom (LE=1.37 * pKi/Heavy Atom Count, Figure 7a), is commonly considered in discovery programmes as a quality metric [33]. [LE (Equation 3 in the H2L best practices post graphic) is actually defined as the Gibbs free energy of binding, ΔG° (Equation 2 in H2L best practices post graphic), divided by the number of non-hydrogen atoms, N_nH (this is identical to heavy atom count although I consider the term to be less confusing), but the quantity is physically (and thermodynamically) meaningless because perception of efficiency varies with the arbitrary concentration, C°, that defines the standard state (see Table 1 in NoLE). Using a standard concentration enables us to calculate changes in free energy that result from changes in composition and, while the convention of using C° = 1 M when reporting ΔG° values. is certainly useful, it would be no less (or more) correct to report ΔG° values for C° = 1 µM. Put another way the widely held belief that 1 M is a 'privileged' standard concentration is thermodynamic nonsense (Equation 2 in the H2L best practices post graphic shows you how to interconvert ΔG° values between different standard concentrations). Given the serious deficiencies of LE as a drug design metric, I suggest modelling the response of affinity to molecular size and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the 'Alternatives to ligand efficiency for normalization of affinity' section of NoLE). There are two errors in the expression that the Authors have used for LE (the molar energy units are missing and the expression is written in terms of K_i rather than K_D). The factor of 1.37 in the expression for LE comes from the conversion of affinity (or potency) to ΔG° at a temperature of 300 K, as recommended in [35], although biochemical assays are typically are typically run at human body temperature (310 K). My view is that it is pointless to include the factor of 1.37 given that this entails dropping the molar energy units and using a temperature other than that at which the assay was run. Dropping the factor of 1.37 would also bring LE into line with LLE (LipE).] Various analyses suggest that, on average, this value barely change over the course of an optimisation process [20b | 27 | 34a | 34b] - so it is important to consider maintenance of any figure during any early SAR studies. [I disagree with this recommendation. These analyses are completely meaningless because the variation of LE over the course of an optimization itself varies with the concentration unit in which affinity (or potency) is expressed (Table 1 of NoLE illustrates this for three ligands of that differ in molecular size and potency). In [34a] the start and finish values values of LE were averaged over the different optimizations without showing variance and it is therefore not accurate to state that the study supports the assertion that LE values "barely change over the course of an optimisation process".] Lipophilic Ligand Efficiency (activity minus lipophilicity typically pKi -log P, Figure 7b), which is widely recognised as the key principle in successful drug optimisation, comes into play both for hit prioritization and optimisation. [LLE is a simple mathematical expression and I don’t consider it accurate to describe it as a “principle” let alone “the key principle in successful drug optimisation”. LLE can be thought of as quantifying the energetic cost of transferring a ligand from octanol to its target binding site although this interpretation is only valid when the ligand is predominantly neutral at physiological pH and binds in its neutral form. LLE is just one of a number of ways to normalize potency with respect to lipophilicity and I don't think that anybody has actually demonstrated that (pIC₅₀ – log P) is any better (or worse) as a drug design principle than pIC₅₀ – 0.9 × log P. When drug discovery scientists report that they have used LLE it often means that they have plotted their project data in a similar manner to Figure 3 as opposed to staring at a table of LLE values for their compounds. As an alternative to LLE (LipE) for normalization of affinity (or potency) with respect to lipophilicity I suggest modelling the response and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the 'Alternatives to ligand efficiency for normalization of affinity' section of NoLE).] Improving this value reflects producing potent compounds without adding excessive lipophilicity. Taken together, it has been shown that for any given target, the drugs mostly lie towards the leading "nose" [?] where LE and LLE are both towards higher values [20b | 35]. [This perhaps not the penetrating an insight that the Authors consider it to be, given that drugs are usually more potent than the leads and hits from which they have been derived.] However, setting aspirational targets for either metric is unwise, as analysis of outcomes indicates that the values are target dependant [20b]. [I consider target dependency to be a complete red herring in this context and a more important issue is that you can’t compensate for inadequate potency by reducing molecular size or lipophilicity.] Focusing on increasing LLE to the maximum range possible and prioritizing series with higher average values is the recommended strategy [27 | 36]. [It is not clear what is meant by “increasing LLE to the maximum range possible” and I consider it very poor advice indeed to recommend “prioritizing series with higher average values” (my view is that you actually need to be comparing the compounds from different series that have a realistic chance of matching the desired lead profile. The Authors of Q2025 appear to be misrepresenting [36] given that the study does not actually recommend “prioritizing series with higher average values”. This blog post on [27] might be relevant.]

One can summarize this section with a simple but critical best practice: potency and properties (physicochemical and ADMET) have to be optimized in parallel (Figure 8) [37] to get to quality leads and later drug candidates with higher chances of clinical success. Whilst seemingly trivial, this proposition is rendered challenging by an "addiction to potency" and a constant reminder of this critical concept remains useful for medicinal chemists [38]. [My view is that many medicinal chemists had already moved on from the addiction to potency when the molecular obesity article was published a decade and a half ago and I would question the article's relevance to contemporary H2L practice. The threshold values that define the GSK 4/400 rule actually come from an arbitrary scheme used to categorize the proprietary data analyzed in the G2008 study as opposed to being derived from objective analysis of the data. The study reproduces the promiscuity analysis from [13a] which I criticised earlier in this post for exaggerating the strength of the trend and using an excessively permissive threshold for ‘activity’.] With poor properties, even "good ligands" may not fully answer pharmacological questions [39a | 39b]. [These two articles focus on chemical probes and I don’t consider either article to have any relevance to H2L work. Chemical probes need to be highly selective (more so than drugs) and permeable although solubility requirements are likely to be less stringent when using chemical probes to study intracellular phenomena than in H2L work and you don't generally need to worry about achieving oral bioavailability.]

I agree that mapping SARs for structural series of interest is an important aspect of H2L work and activity cliffs (small modifications in structure resulting in large changes in activity) are of particular interest given the potential for beating trends and achieving greater selectivity. Instances of decreased lipophilicity resulting in increased potency (or at least minimal loss of potency) should also be of significant interest to H2L teams. When mapping SARs it is important that structural transformations should change a single pharmacophore feature at a time and one should always consider potential ‘collateral effects’, such as perturbed conformational preferences, that might confound the analysis. Some of the structural transformations shown in Figure 10 change more than one pharmacophore feature at a time which makes it impossible to determine which pharmacophore feature is required for activity.

Figure 10. Conceptual example of iterative SAR [The meaning of the term “iterative SAR” is unclear] to determine the pharmacophore. As each change may affect binding interactions, conformation and ionization state; complementary structural modification [The meaning of "complementary structural modification" is unclear] will be needed to understand the change in potency and determine the pharmacophore
Is Nitrogen needed (e.g. HBA)? [In addition to eliminating the quinoline N hydrogen bond acceptor this structural transformation eliminates a potential pharmacophore feature (the amide carbonyl oxygen can function as a hydrogen bond acceptor) while creating a cationic centre which will incur a significant desolvation penalty.]
Is NH needed? [This structural transformation eliminates the amide NH but it also is unlikely to address the question of whether the NH is needed because the amide carbonyl has also been eliminated.]
Is carbonyl needed? [The elimination of the amide carbonyl oxygen (hydrogen bond acceptor) creates a cationic centre which will incur a desolvation penalty.]

As a last proposition, [49a | 49b] we suggest that the progress in computational physicochemical and ADMET property predictions represents an opportunity to accelerate the optimisation of molecules with a "predict-first" mindset [4 | 50]. [I certainly agree that models should be used if they are available. However, the citation of literature does appear to be gratuitous and it is unclear why the Authors believe that scientists working on H2L projects will benefit from knowing that a proprietary system for automated molecular design has been developed at GSK.] The first step is to generate sufficient data for a series to build confidence in [51] any models, which can then be exploited in the prioritization of compounds for synthesis that fit with aspirational profiles [My view is that it would be very unwise for H2L project teams to blindly use models without assessing how well the models predict project data although I consider the citation of [51] to be gratuitous been cited. Typically, H2L project teams use measured data to move their projects forward and generating data purely for the purpose of model evaluation is likely to be a distraction. One piece of advice that I will offer to H2L project teams is that they attempt to characterise responses of ADME predictors, such as aqueous solubility and permeability, to lipophilicity (likely to involve measurements for less potent compounds).] This ensures higher physicochemical quality [I consider “ensures” to be an exaggeration and I would argue that “physicochemical quality” is not something that can even be defined meaningfully or objectively (let alone quantified).], asks more pertinent questions and might reduce the total number of molecules made to get to the lead (Figure 11).

The Authors offer advice on how to ensure that optimisation is progressing in a satisfactory manner and how to know when to stop working on the series.

A Lead is not the perfect drug, but it gives reason to believe that the chemical series might be able to deliver one. An essential part of H2L (and later lead optimisation) is to ensure that the optimisation is progressing so that further investment is justified. Some essential questions can help achieve this: Does your series show dynamic SAR [The Authors need to say exactly what they mean by “dynamic SAR” if this is indeed the essential question that they assert it to be.] and achievable desired potencу? Is the preliminary ADMET data encouraging? [The Authors need to define “encouraging” if this is indeed an essential question.] Do you have evidence of in vivo effect (PK/PD) at appropriate exposures? [I would question the necessity of PK/PD studies before starting lead optimisation and there are potential ethical concerns about doing in vivo work using compounds that lack the potency required for meaningful PK/PD assessment.] Do the remaining challenges show dynamic SAR and confidence they can be optimized? [The term “remaining challenges” is vague and it is not clear how H2L scientists are supposed to assess “dynamic SAR” for remaining challenges that are not defined in terms of activity.] To answer this, it's critical to monitor the trajectory [As I pointed out previously in the post it is not generally feasible to objectively map optimization paths and I consider the use of “trajectory” to be inappropriate in this context, given that it usually applies to a well-defined path that is determined at launch (for example, a molecular dynamics trajectory).] of the optimisation: e.g. by monitoring relevant properties over time. [Typically, H2L teams assess how closely the best compounds match the lead target profile (LTP) as opposed to monitoring time dependencies of properties such as log D that have limited predictivity.] In the absence of progress, discontinuing further work on a scaffold or series may be justified, with reason to focus on other promising structures or recommend termination on a data-driven basis. [Generally, the decision to terminate projects and series will be made on the basis of failure to satisfy the LTP.]

It's been a long post and I'll say a big thank you for staying with me until the end. I wrote this post primarily for early-career scientists as well as for drug discovery scientists in academia and students (although I hope the feedback will also be helpful for the EFMC). One piece of advice that I will offer to all scientists regardless of the stage of their careers is to not switch off your critical thinking skills just because a study is presented as defining best practices or has been highly-cited. In particular, I urge all scientists to be extremely wary of studies in which the conclusions don't follow from the data and I'll share a recent blog post that illustrates the problem. All that said, however, confused thinking amongst drug discovery scientists is not high on the list of the problems facing many of the world's inhabitants right now and my wish for 2026 is for a kinder, gentler, fairer and more peaceful world.

Covalent ligand efficiency

2025-11-30T22:54:00.011+00:00

It appears that whoever first described Economics as the ‘Dismal Science’ had never encountered a ligand efficiency metric. I’ll be taking a look at the FK2025 study (Covalent ligand efficiency) in this post and the study has already been reviewed by Dan. Something that I’ve observed repeatedly over the years is that authors of ligand efficiency studies exhibit a lack of understanding of units and dimensions associated with physicochemical quantities that would shame a first year undergraduate studying introductory physical chemistry (this is somewhat ironic given that creators of ligand efficiency metrics frequently tout their creations in physicochemical terms). I consider covalent ligand efficiency (CLE) as defined in the FK2025 study to have no value whatsoever for design of drugs that bind irreversibly to their targets through covalent bond formation given that the metric is time-dependent and based on an invalid measure of bioactivity. The formidable Lady Bracknell is clearly unimpressed and I should mention that the photo is from the wikipedia page for English actress Rose Leclercq (1843-1899). Given the serious deficiencies in the FK2025 study this is going to be a long and tedious post (even more so than usual 😁😁😁) so please ensure that you have strong coffee close to hand. As is usually the case in posts here I've used the same reference numbers as were used in FK2025 and quoted text is indented with my comments in red italics. I've organized some of the mathematical material into three tables and references to tables in the post are to these (and not to any of the tables in the FK2025 study).

Before starting the review of FK2025 it’s worth examining irreversible covalent inhibition from a molecular design perspective and I’ll direct readers to the informative S2016 and McW2021 reviews, and the recent L2025 study which presents COOKIE-Pro for covalent inhibitor binding kinetics profiling on the proteome scale. Covalent bond formation between RNA and ligands can also be exploited (S2025 | K2025 | L2015) and I generally use 'target' rather than 'protein' in blog posts and journal articles. An irreversible covalent inhibitor acts by first binding non-covalently to its target in the first step with the covalent bond forming in the second step between an electrophilic ligand atom (the term warhead is commonly used) and a nucleophilic target atom such as the sulfur atom of a cysteine. A commonly used measure of activity for irreversible covalent inhibitors is the k_inact/K_I ratio which can be thought of as the product of affinity (1/K_I) and reactivity (k_inact). In design of irreversible covalent inhibitors we try to place the electrophilic atom of the warhead within reacting distance of the nucleophilic atom of the target (this is relatively easy if you have a reliable structure of a complex of the target with a relevant ligand that lacks the electrophilic warhead). The non-covalent complex between target and ligand is stabilised by the non-covalent contacts between the target and ligand (the term ‘molecular interactions’ is also used although I prefer to think in terms of ‘non-covalent contacts’ since the latter can be observed experimentally). However, non-covalent contacts also determine reactivity of the non-covalently bound complex by stabilising the transition state (I consider it more correct to think in terms of reactivity of the complex than in terms of reactivity of either the electrophilic warhead or the target nucleophile). In the design context, this means attempting to tune non-covalent contacts to stabilise the transition state to a greater extent than the non-covalent complex.

The LE and CLE metrics share a very serious deficiency in that your perception of efficiency can be altered if you change the value of an arbitrary term in the formula for the metric and I'll start the review of FK2025 by critically examining LE. The meaningless of LE stems from a fundamental misunderstanding of how logarithms work and I'll by point you toward M2011 (Can one take the logarithm or the sine of a dimensioned quantity or a unit? Dimensional analysis involving transcendental functions) that was published in the Journal of Chemical Education. In drug discovery we frequently need to calculate logarithms for quantities and you need to be aware you can’t calculate the logarithm for a dimensioned quantity. Let’s take pIC₅₀ as an example and this quantity is commonly defined as the negative logarithm of the IC₅₀ in mole per litre (M). However, what you actually do when you calculate pIC₅₀ is that you take the negative logarithm of the numerical value of the IC₅₀ when expressed in mole per litre (this is a bit of a mouthful and it can be written more compactly as equation 1 below). While not denying that it is useful to have a convention such as this for expressing potency values logarithmically it should be remembered that the choice of mol per litre (M) is entirely arbitrary and it would be equally correct to use other valid concentration units such as μM or nM. One consequence of choosing mole per litre (M) for expressing IC₅₀ values is that pIC₅₀ values (or at least measured pIC₅₀ values) will generally be positive because of the extreme difficulty of measuring meaningful IC₅₀ values that are greater than 1 M.

Let’s take a look at the binding free energy ΔG° and you’ll notice that I’ve written it with a degree symbol which indicates that this quantity corresponds to a standard state defined by a concentration value C° (the standard concentration). Equation 2 shows how the binding free energy is defined as the difference in chemical potential between the associating species (target + ligand) and the target-ligand complex with each species at the standard concentration (the degree symbol indicates that that both the binding free energy and chemical potential depend on the value of C° and I’ve also shown this explicitly in the equation although this is not actually necessary). Equation 3 shows the dependence of chemical potential on the concentration C of the species and the standard concentration C°. Taken together, Equation 2 and Equation 3 should clarify the origins of the dependence of binding free energy on the standard concentration comes from (there are two associating species but only one complex). We can’t actually measure binding free energy directly but we can calculate it from the dissociation constant K_D using Equation 4 (which can be derived from Equation 2 and Equation 3). It’s important to be aware that if you use Equation 4 to convert ΔG° values between different values of the standard concentration C° you’ll be making the assumption that solutions are dilute (ΔH is independent of concentration) and this is indicated in Equation 5.

The standard concentration is a source of much confusion in the ligand efficiency field and I’ll direct readers to ‘The Nature of Ligand Efficiency’ (p9). While the standard concentration is integral to a valid thermodynamic treatment of target-ligand binding the value of C° is entirely arbitrary (to suggest otherwise would mean that you’ve abandoned thermodynamics). It is conventional in drug discovery (and biochemical) literature to use a C° value of 1 M when reporting ΔG° values. While this convention is certainly beneficial, it is no more (and no less) valid to use a value of 1M than it is to use a value of 1 μM for this purpose. Furthermore a standard state defined by a C° value of 1 M is not biophysically realistic (consider the difficulty of accommodating a mole of target in a volume of 1 litre and the likelihood of a ligand exhibiting aqueous solubility of 1 M). I assume that most biochemists and biophysicists would agree that it is generally not feasible to measure K_D values of greater of 1 M (I would be happy to be proven wrong on this point) and this means that drug discovery scientists tend to assume that ΔG° values are necessarily negative.

Let’s now a take a look at ligand efficiency (LE) and you can see from the photo above that some heretics regard the metric as physically nonsensical (if you're interested in how I came to be chatting with fellow blogger Ash then take a look at this post). The LE metric which is regarded as an article of faith in the fragment-based design community was introduced in the (p5) study with the symbol Δg (see Equation 6 below) and the authors of that study did not actually state that it had to be calculated using a C° value of 1 M (I consider it unlikely that any of the authors were even aware of the dependence of ΔG° on C°). In The Nature of Ligand Efficiency (p9) I defined the quantity η_bind (see Equation 7) by dividing Δg (LE) by RT (when LE values are quoted the molar energy units are usually discarded and T often does not correspond to the temperature at which the assay was run) and by the factor (2.303) used to convert between natural logarithms and base 10 logarithms. The quantity η_bind is directly proportional to Δg (LE) and using it makes it much easier to see how using a different standard concentration can alter your perception of efficiency. Take a look at Table 1 in (p9) and you’ll see that the three compounds (a fragment, a lead and a clinical candidate) bind with equal efficiency when C° is 1 M. Change C° to 0.1 M and the clinical candidate is binds more efficiently than the fragment but when C° to 10 M the fragment becomes more ligand-efficient than the clinical candidate. As noted in (p9) “In thermodynamic analysis, a change in perception resulting from a change in a standard state definition would generally be regarded as a serious error rather than a penetrating insight.”

Here's what the authors of FK2025 say about LE:

LE depends on the choice of the standard concentration (normally 1 M) (p8) (p9) and its maximal available value is size dependent.(p10) (p11) [It's true that LE depends on C° but it’s also true that ΔG° depends on C° and the difference in the two dependencies is that is that LE “depends upon the choice of standard concentration in a nontrivial fashion” (p8). The issue is not the so much that LE depends on C° but that using a different unit to express K_D changes how we perceive efficiency. The ΔΔG values that determine perception of affinity don’t change when you use a different value of C° (equivalent to using a different unit to express K_D). However, if you use a different value of C° for calculating LE you can see from Table 1 in (p9) that even the ordering of LE values between two ligands can change. I consider the molecular size dependencies of LE observed by the authors of (p10) and (p11) to be artefactual and I’ll point you toward Fig. 1 in (p9) which shows that using a different value of C° can change how we perceive the molecular size dependency of LE.] Nevertheless, LE is an established tool to normalize potency and facilitate the comparison of ligands with a range of potencies and sizes. [It is not uncommon for adherents of religions to consider their beliefs to be established facts.] The usefulness of LE and other efficiency metrics in drug discovery has been extensively analyzed and reviewed elsewhere. (p6) (p12) (p13) (p14) (p15) (p16) (p17) [My view is that nobody has actually demonstrated the usefulness of LE and I’m unconvinced that it would even be possible to do so meaningfully in an objective manner (consider the feasibility of comparing success rates between a group of individuals using LE in discovery projects and a control group of individuals not using LE in discovery projects). Usefulness means that using something provides demonstrable benefits and ‘widely-used’ is not equivalent to ‘useful’ (I’m guessing that more people use homeopathic ‘medicines’ than use ligand efficiency metrics). One piece of advice that I’ll offer to anybody advocating the use of LE in drug design is to ensure that you fully understand the implications of changes in perception resulting from using different units to express quantities not least because you might find yourself lecturing to people who do understand.]

After a lengthy preamble it’s now time to take a look at how the FK2025 study addresses ligand efficiency in the context of irreversible covalent inhibition. One of the challenges in design of drugs that engage their targets irreversibly is that it’s not possible to meaningfully quantify activity with a single parameter. This is particularly relevant to definition of efficiency metrics which are typically derived by either scaling or offsetting a measured activity value by a risk factor such as molecular size or lipophilicity. While you can certainly measure an IC₅₀ value for an irreversible covalent inhibitor the value that you measure will be time-dependent and it’s not generally meaningful to compare two IC₅₀ values that have been measured using different incubation times. While the k_inact/K_I ratio is time-independent using it as a measure of activity necessarily entails a degree of information loss.

The authors of FK2025 state:

Our starting point is the LE introduced for noncovalent ligands as a useful metric for lead selection. (p5) [LE was claimed to be useful when it was introduced although no evidence was presented in support of the claim.]

Let’s take a look at Table 2 which shows two equations from FK2025. The first equation, which appears in the text of the article, illustrates two common errors in the efficiency metric field (taking logarithms of dimensioned quantities and discarding units). It should ring alarm bells for the reader when authors make either error (especially if the authors interpret values of the efficiency metrics).

The authors of FK2025 assert that “LE can be decomposed into contributions from the noncovalent recognition and the covalent reaction (Box 2, Equation III)” and this is reproduced in Table 2 as Equation 2. The first term is a commonly-used mathematical formula for LE when inhibition is reversible and it is important to be aware that K_I has been divided by an arbitrary concentration value (1 M) in order that it can be expressed as a logarithm (see Equation 1 in Table 1). The argument of the logarithm in the second term is dimensionless although its magnitude does vary with t. Each term in Equation III (Box 2) has a nontrivial dependence on the value of an arbitrary quantity (the 1 M concentration in the first term and t, in the second term). This means that your perception of efficiency when calculated according to Equation III (Box 2) will be altered if you use either a different concentration unit or a different value of t. You can see this effect in Figure 2 (effect of varying of t) and the appearance of Figure 1 will be altered if you use a value of t other than 1 h or a concentration unit other than M for the calculation of LE.

It's now time to examine CLE (defined as Equation II in Box 3 of the FK2025 study) and I’ll direct you to Table 3 below in which I’ve made some comments. Using CLE requires that the IC₅₀ values for the inhibitors of interest all correspond to the same time point (t) and it is not clear whether the authors are suggesting that that the IC₅₀ values should all be measured using the same incubation time or need to be calculated from measured K_I and k_inact values using Equation III in Box 2. A quantity t is also explicitly present in the argument of the logarithm in Equation II in Box 3 and this is necessary for the argument of the logarithm to be dimensionless (see M2011). The argument of the logarithm in Equation II (Box 3) is clearly time-dependent and this means that your perception of efficiency will be altered if you use a different value of t when calculating CLE (just as your perception of efficiency will be altered if you use a different concentration unit to express IC₅₀ when you calculate LE for reversible inhibitors). It also means the molecular size dependency of CLE will vary with time just as the molecular size dependency of LE varies with the concentration unit used to express affinity as can be seen in Fig. 1 of (p9).

However, there is another difficulty which is that the argument of the logarithm in Equation II (Box 3) is not a valid measure of activity (the same criticism can also be made of the xLE metric introduced in the Z2025 study that Dan has already reviewed). This problem is a bit more subtle and it’s important to remember that knowing the IC₅₀ value for a reversible inhibitor enables you to generate a concentration response for inhibition. When you express an IC₅₀ value as a logarithm you need to scale it by a concentration value to ensure that the argument of the logarithm function is dimensionless (see M2011) but it’s important to remember that the concentration unit is still there even though it’s not shown (see Equation 1 in Table 1).

This is a good point at which to wrap up and I’ve argued that CLE has two deficiencies. First, perception of efficiency and its dependency on molecular size both vary with an arbitrary quantity (t) in the argument of the logarithm (this is analogous to the problems caused by the arbitrary nature of the concentration unit used for scaling affinity/potency in the definition of LE for reversible binders). Second, the argument of the logarithm is not a valid measure of activity because it cannot be used to generate a concentration response. Furthermore, I would question the value of aggregating results from multiple assays for analysis even for a valid metric without these deficiencies and I offered the following advice in (p9):

Drug designers should not automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.

I’ve criticized the FK2025 study at length and saying how I might use data like this in drug design projects is a good way to conclude the post. A general criticism that I have made of drug design efficiency metrics is that they are based on assumptions of relationships between activity and risk factors such as molecular size. I argued in (p9) that one should use the trend that is actually observed in the data to normalize activity with respect to risk factors and I’ll point you to the relevant section (Alternatives to ligand efficiency for normalization of affinity) in that article. I would start by attempting to model the relationship between k_inact and reactivity with glutathione. The objective of this exercise is to identify inhibitors that best exploit their intrinsic reactivity when forming covalent bonds with the target residue (you can quantify this by how far the point for an inhibitor lies above the trend line and the most interesting compounds have the largest positive residuals). I might also examine the relationship between k_inact and K_I for inhibitors with the same intrinsic reactivity (e.g., incorporating the same warhead) with a view to identifying the inhibitors for which non-covalent interactions with the target most effectively stabilise the transition state relative to the non-covalent complex. I should stress that there is no suggestion that these analyses would necessarily yield useful insight.

It's been a long post so thanks for staying with me. This will be the the last post until after Christmas and, as I extend best wishes to all for a happy and peaceful festive season, I'm keenly aware that Christmas will be neither happy nor peaceful for many of our fellow human beings.

Return to Flatland

2025-08-05T21:39:00.021+01:00

Whoever first referred to Economics as ‘The Dismal Science’ had clearly never read an article on ‘3Dness’ in drug discovery. My own experience reading articles on this topic is a sensation of having my life force slowly sucked out (I even suggested that reviewing the '3Dness' literature might be considered as an appropriate penance when I recently confessed my sins at St Gallen Cathedral) and the subject of Confession reminds me of a song that the late great Tom Lehrer sang about the Second Vatican Council.

In this post I review the CNM2025 study (Return to Flatland) which examines the heavily-cited LBH2009 study (Escape from Flatland: Increasing Saturation as an Approach to Improving Clinical Success).This also is a good point to mention a Journal of Medicinal Chemistry Editorial (Property-Based Drug Design Merits a Nobel Prize) that I reviewed in a 30-Jul-2024 post. The CNM2025 study, which has already been reviewed by Dan and Ash, opens with:

The year is 2009, Barack Obama has just been inaugurated and both Lady Gaga and The Black Eyed Peas are at the height of their popularity

This couldn’t help but remind me of the “WORLD WAR 2 BOMBER FOUND ON MOON” headline that appeared on the front page of the Sunday Sport twenty-one years before the publication of LBH2009 (it was was accompanied by a photo of a B-17 in a lunar crater). A few weeks later the headline was “WORLD WAR 2 BOMBER FOUND ON MOON VANISHES” (this time accompanied by a photo of the now empty lunar crater).

I’ll start my review of CNM2025 by quoting from it and, as is usual for posts here at Molecular Design, quoted text is indented with any comments by me italicized in red and enclosed in square brackets.

The hypothesis was attractive, and the data clearly showed the relationship between Fsp³ and clinical progression with pairwise significance P < 0.001. [This statement is inaccurate and Figure 3 of the LBH2009 study shows statistically significant differences at this level between (a) discovery and phase 2 compounds (b) phase 1 and phase 3 compounds (c) phase 2 compounds & drugs. The authors of LBH2009 state: “The change in average Fsp³ was statistically significant between adjacent stages in only one case (phase 1 to phase 2)” but they neither show this in Figure 3 of their article nor do they report a P-value for the statistical significance of the mean difference in Fsp³ between phase 1 and phase 2 compounds.] The statistics seemed compelling, though the effect size was modest — an increase in average Fsp³ of 0.09 between sets of phase I and approved drugs equates to a difference of around two additional sp³ carbons per drug molecule only. [The authors of LBH2009 did not actually report this difference to be statistically significant so it is unclear why the authors of CNM2025 have stated that the “statistics seemed compelling”.]

The LBH2009 study is effectively a call to think beyond aromatic rings in drug design and my view is that there are considerable benefits in doing so even though I consider the data analysis in the study to be shaky. Almost three decades ago I included a quinuclidine in the Zeneca fragment library for NMR screening and later at AstraZeneca I would actively search (with minimal success) for amides and heteroaryls derived from bicyclic amines. I see the advantages in looking beyond aromatic rings as stemming primarily from increased molecular diversity and a more controllable coverage of chemical space, and in KM2013 we wrote:

Molecular recognition considerations suggest a focus on achieving axial substitution in saturated rings with minimal steric footprint, for example by exploiting the anomeric effect or by substituting N-acylated cyclic amines at C2.

Although data analyses (for example, see HY2010) presented in support of the belief that aromatic rings adversely affect aqueous solubility are typically underwhelming I consider the suggestion to be plausible and suggested in K2022 that deleterious effects of aromatic rings are more likely to be due to their potential for making molecular interactions than to their planarity. That said, I should also point out that the analysis of the relationship between aqueous solubility and Fsp³ presented in Figure 5 of LBH2009 is a textbook example of correlation inflation (see Fig. 5 in KM2013) and I suspect that if a team had submitted this analysis at Statistiques Sans Frontières the judges would have either awarded “nul points” or come to the conclusion that the team had played its joker. Given the Lady Gaga reference in CNM2025 I couldn't resist linking this Peter Gabriel song which includes the lyrics "Adolf builds a bonfire, Enrico plays with it" even though I have absolutely no idea what the the lyrics actually mean.

While the analysis of the relationship between aqueous solubility presented in Figure 5 of LBH2009 does endow the study with what I’ll politely call a whiff of the pasture it’s not directly related to the analysis of clinical progression presented in the study. Let’s take a look at Figure 3 in LBH2009 which shows mean Fsp³ values for compounds in discovery, at the three phases of clinical development, and approved drugs. As an aside this analysis would fall foul of current Journal of Medicinal Chemistry author guidelines (see link; accessed 05-Aug-2025) which clearly mandate that “If average values are reported from computational analysis, their variance must be documented”. As mentioned earlier in this post Figure 3 in LBH2009 shows statistically significant (P value < 0.001) differences between (a) discovery and phase 2 compounds (b) phase 1 and phase 3 compounds (c) phase 2 compounds & drugs. It’s also worth stressing that Figure 3 in LBH2009 does not show statistically significant differences in Fsp³ for any of the clinical development transitions (phase 1 to phase 2; phase 2 to phase 3; phase 3 to approved drug). Figure 3 of in LBH2009 shows 591 phase 2 compounds but only 376 phase 1 compounds, raising questions about the numbers of compounds that have been in clinical development without being recorded in the database.

I think that there are some problems with how the authors of the LBH2009 study have analysed the relationship between Fsp³ and progression through the stages of clinical development. If charged with analysing this data I would focus on the three clinical development transitions (phase 1 to phase 2; phase 2 to phase 3; phase 3 to approved drug) and wouldn’t waste time on comparisons between discovery compounds and clinical compounds. If analysing the relationship between Fsp³ and the progression from phase 1 to phase 2, I would partition the set of phase 1 compounds into a ‘YES’ subset of compounds that had progressed to phase 2 and a ‘NO’ subset of compounds that had not progressed to phase 2. I would certainly be taking a close look at distributions of Fsp³ values (some approaches to assessing statistical significance are based on the assumption of Normally-distributed data values) and I’d also be thinking about assessing effect size in addition to statistical significance. However, the problems with the LBH2009 analysis are more fundamental than non-Normal distributions of Fsp³ values.

The authors of LBH2009 assess the progression from phase 1 to phase 2 by comparing the mean Fsp³ value for the phase 1 compounds with the mean Fsp³ value for phase 2 compounds. The problem is that the Fsp³ values for the YES compounds (that have progressed from phase 1 to phase 2) are present in both the data sets for which comparisons are being made. This means that the observed differences in mean Fsp³ values will reflect both the difference between YES and NO compounds (relevant to relationship between Fsp³ and progression from phase 1 to phase 2) and the relative numbers of YES and NO compounds in the phase 1 data (not relevant to relationship between Fsp³ and progression from phase 1 to phase 2). Analysing the data in the way that the authors of LBH2009 have done effectively adds noise to the signal and it’s possible that they would have observed more statistically significant differences in mean Fsp³ values had they analysed the data in a more appropriate manner.

This is an appropriate point at which to discuss correlation in the context of studies such as LBH2009 and CNM2025. It’s actually well known (see L2013) that that Fsp³ values for chemical structures tend to be greater when amine nitrogen atoms are present (this does not invalidate the observed trends in the data but has big implications for how you interpret these trends). There is, however, a much bigger issue which is that correlation does not imply causation. Let’s suppose that you’ve just joined a drug discovery team as they are preparing to select a clinical candidate (I concede that this is most improbable scenario but it does illustrate a point). The team have an excellent understanding of the structure-activity relationship (SAR) and have successfully addressed a number of issues during the lead optimization process (the chemical structures of the compounds have been quite literally shaped by the problems that the team members have solved). Now consider the likely reaction of the team members to a suggestion that probability of success in the clinic would increase if the chemical structure of the best compound were modified so as to increase its Fsp³ value. My view is that the team might think that the person making such a suggestion had just stepped off the shuttle from Planet Tharg (an alien from this planet used to make occasional Sunday Sport appearances). I see the trends in data observed by the authors of LBH2009 as effects rather than causes (the vanishing B-17 was never there in the first place).

Let’s return to the CNM2025 study and its authors state:

Using data from the Cortellis Drug Discovery Intelligence database, we repeated an analysis similar to that of Lovering et al. to assess Fsp³ in drugs approved post-2009 and those in active clinical development as of mid-2024 (Fig. 1). [I would challenge the claim that the analysis presented in CNM2025 is similar to that presented in LBH2009. The supplementary material for CNM2025 indicates that the data summarised in Fig. 1b correspond to the period 2012 through 2024 (it is not clear whether the database has been updated to account for compounds that have fallen out of active development during this period. As is the case for Figure 3 in LBH2009, Fig.1b in CNM2025 shows more phase 2 compounds (816) than phase 1 compounds (421), raising similar questions about the numbers of compounds that have been in clinical development without being recorded in the database. I thank fellow blogger Dan Erlanson for suggesting that I examine the supplemental information for CNM2025.] Although our methods used contemporary data sources different to Lovering et al., we obtained comparable Fsp³ data for approved drugs prior to 2009. More recently however, the picture appears to have changed with approvals shifting to lower Fsp³ drugs (Fig. 1a). Similarly, when looking at drugs currently in clinical development (Fig. 1b), there appeared to be no clear relationship between highest phase reached and Fsp³, suggesting the key conclusion noted by Lovering et al. has not persisted. In all data sets, exemplars with Fsp³ = 0 as well as Fsp³ = 1 are extensively seen. [It is necessary to account for the number of hypotheses have been tested for statistical significance when quoting P-values (see R2016 and VM2018).]

Fig. 1a in CNM2025 shows the time-dependence of Fsp³ distributions for approved drugs according to approval date and I remain unconvinced of the value of analysis like this (on first encountering analysis of time-dependence of drug properties a quarter of a century ago I recall being left with the distinct impression that some senior medicinal chemists where I worked had a bit too much time on their hands). However, it is immaterial whether or not you are as underwhelmed as I am by time-dependence of drug properties because no such analysis is actually reported in LBH2009 and this is one reason that I challenge the claim by made by the authors of CNM2025 that they “repeated an analysis similar to that of Lovering et al. to assess Fsp³ in drugs approved post-2009 and those in active clinical development as of mid-2024”.

Now let’s take a look at Fig. 1b in CNM2025 and this should be compared with Figure 3 in LBH2009. In some ways the former is an improvement on the latter since the violin plots show the distributions of Fsp³ values for each group of compounds and, as mentioned earlier in the post, I don’t think that it makes any sense to include discovery compounds in analysis like this (as the authors of LBH2009 did). Although these two figures look superficially similar they are actually very different and, given that the authors of CNM2025 only included "compounds in clinical trials as of mid-2024" in their study, I would argue that their study does not properly examine the link between Fsp³ and clinical progression. I agree that the difference between mean Fsp³ values for drugs approved up to 2009 and for drugs approved after 2009 is statistically significant. What is not clear from the analysis summarized in Fig. 1b in CNM2025 is whether the lower Fsp³ values of drugs that were approved after 2009 reflect smaller increases in Fsp³ over the course of clinical development (the B-17 has disappeared from the lunar crater) or lower Fsp³ values for compounds entering clinical development (the B-17 is still in the lunar crater). I think it's possible to address this question but you would need to analyse the data a lot more carefully than the authors of CNM2025 appear to have done. For example, you might examine the time-dependencies of mean Fsp³ values for compounds evaluated in phase 1 and the corresponding mean Fsp³ values for compounds that progressed or failed to progress to phase 2. While I consider more careful analysis of progression to be feasible I see little or no value from the perspective of real world drug discovery in actually performing the analysis more carefully.

This is a good point at which to wrap up and, unless the the trends in the data can shown to reflect causation, the debate can be described as bald men fighting over a comb (as one who is follicly challenged I always find it painful to use this phrase). I see variation in drug properties with time as an effect rather than a cause and Forrest Gump would have been well aware of this fifteen years before the publication of LBH2019 when he famously observed that "shit happens". One point on which the CNM2025 authors and I do appear to agree is that there is not currently a B-17 in a lunar crater. Where we appear to differ is that they seem to be suggesting this was because it has vanished while I never believed that it was ever there in the first place. I’ll let the late great Dave Allen have the last word.

Assembling data sets for training ML bioactivity models

2025-07-06T21:02:00.015+01:00

Here’s a photo from one of my exercise walks in Paramin and you can see the Caribbean Sea in the distance. This is perhaps my favourite view on the walk because it means that I’ve just got to the top of a particularly brutal hill (cars sometimes struggle to get to the top and on one occasion I watched a car fail miserably in four attempts) although you can’t always see the sea as clearly as in this photo.

The current post follows up on my post on the LR2024 study (Combining IC₅₀ or K_iValues from Different Sources Is a Source of Significant Noise). In the current post, I’ll be discussing in general terms how I might use ChEMBL to assemble data sets for training what I refer to in another post as regression-based machine learning (ML) models. These models can reasonably be described as quantitative structure-activity relationships (QSARs) because 'activity' is a continuous (as opposed to categorical) variable. However, the term 'QSAR' does appear to be less used these days, possibly reflecting the limited impact that QSAR approaches have made on real world drug discovery, and it's also much easier to persuade people that you're doing artificial intelligence (AI) if you describe your QSAR models as ML models. In this post I shall refer to regression-based ML models for biological activity simply as 'QSAR-like ML models'.

Much of the focus of AI-based drug design appears to be generation of novel chemical structures and devising synthetic routes for the associated compounds. Many who tout AI as a panacea for the ills of drug discovery appear to be assuming that predictively useful QSAR-like ML models will be available or can readily be built even in the early stages of drug discovery projects. I remain skeptical and my view is that if sufficient data are available in ChEMBL for building useful QSAR-like ML models then it is likely that somebody else has already got to where you would like to be. Nevertheless, I do see value in automating the assembly of bioactivity data sets from ChEMBL even if it does not prove feasible to build useful QSAR-like ML models and I'll also be discussing some of the ways that you might use such data sets in the early stages of a drug discovery project.

My first step when assembling a data set (which I'll refer to as a 'bioactivity data set') for training QSAR-like ML models would be to extract from ChEMBL all (in-range) measured values for potency and affinity in assays that have been run against the target of interest. Potency and affinity should be expressed logarithmically for modelling as shown in the figure below and the relevant values are often referred to collectively as ‘pChEMBL’ values (I note in posts here from September and December of 2024, the term is used in the literature without being defined properly). I would generally anticipate that there will be only a single pChEMBL value for most compounds and for compounds for which there are multiple pChEMBL values I would use the mean values to quantify bioactivity for these compounds. In cases where there is more than one pChEMBL value available for individual compounds I would also calculate the standard deviation when two or more pChEMBL values are available for a compounds and this can be seen as another way to assess what is referred to as assay compatibility in the LR2024 study.

A bioactivity data set assembled in this manner would have a single bioactivity data value for each compound and I would take a look at how many compounds that data is available for because it might be possible to use this information for deciding whether or not to build a QSAR-like ML model. However, you need to be careful about using the size of the data set for making decisions like this because you can get away with with fewer data values if these are better distributed from the perspective of model-building (a view from Orwell's Animal Farm might have been: uniform good, polymodal bad) and the comment that Stalin is alleged to have made about the T-34 tank (quantity has a special quality all of its own) is perhaps not quite the ground truth that many ML modellers believe it to be. JFK's advice to ML modellers might have been: ask not whether you have enough data but whether the available data satisfy the requirements for modelling.

My next step would be to examine the distribution of data values in the bioactivity data set. I would take a look at the spread in bioactivity values (for modelling the spread in values should be large). If the distribution of the bioactivity data set is Gaussian then a standard deviation of 0.8 log units will place 80% of the data values in a range of 2.05 log units (I used this handy Normal percentile calculator) and I wouldn't attempt to build a QSAR-like ML model if the standard deviation was less than this (unless the person 'asking' me to build the model was also going to perform my annual performance review 😁). I would also visualise the distribution of bioactivity values because a noticeably polymodal distribution should ring a few alarm bells for me (clustering in training data may cause validation procedures to arrive at optimistic assessments of model quality).

Having established an acceptable spread in the bioactivity data I would take a look at where the distribution of bioactivity values is centred. Specifically, I would not attempt to build a QSAR-like ML model unless at least 50% of the compounds in the bioactivity data set exhibited sub-micromolar activity and for a Gaussian distribution this would correspond to a mean bioactivity value of 6. If this seems a bit extreme it’s worth pointing out that to accurately measure an IC₅₀ value of 10 μM requires that the compound be soluble, while neither aggregating nor interfering with assay read-out, at a concentration of 100 μM. Problems with biochemical assays typically increase when you test compounds at higher concentrations and this is one reason that biophysical assays are generally preferred for screening fragments. With sufficient care you can run biochemical assays at high concentrations and the S2009 article by former colleagues shows how you can assess (and potentially correct for) assay interference. Inadequate aqueous solubility, however, is not something that you can generally deal with. One general difficulty when assembling bioactivity data sets from ChEMBL is that it can be very difficult to assess how carefully low affinity compounds have been assayed.

Before starting to assemble a data set for training QSAR-like ML models I would also assess the target from an assay perspective (in a real world drug discovery scenario this assessment would be done in collaboration with bioscientists). In particular, I would be looking for indications, such as k_inact values being reported, of activity being due to irreversible mechanisms of action. The bioactivity of an irreversible covalent inhibitor can be considered to be 'two-dimensional' (affinity for formation of non-covalently bound target-ligand complex and rate constant for covalent bond formation) and I'll point you to S2016 and McW2021 for more information. It is important to have sufficient spread both in the k_inact and in K_i values when building QSAR-like ML models for irreversible inhibitors and you also need to be aware of any limits that the assays place on values that can reliably quantified. It is common for IC₅₀ values to be reported in the literature for irreversible inhibitors although you can use such data in drug discovery if you run the assays carefully (see T2021). However, it's important to bear in mind that using a single data value to quantify the bioactivity of an irreversible inhibitor necessarily results in information loss and that the ChEMBL curation procedures do not generally capture assay protocols at the level of detail that would be required for combining IC₅₀ values from different studies even when inhibition is reversible. This should not be taken as a criticism of ChEMBL and I consider recording assay protocols in this level of detail to be well beyond the call of duty for those curating the bioactivity data.

Now let’s take a look at scenario in which the objective is to initiate a drug discovery project (as opposed to merely building QSAR-like ML models for the purpose of publication). One point that I really do need to stress is that you’re far from helpless if the data available in ChEMBL do not satisfy the requirements for building QSAR-like ML models. First, you can try to source structural analogs of bioactive compounds (there are many more options these days for doing this than when I worked in industry and you can also look beyond ChEMBL, in patents for example, when identifying bioactive compounds) and, in any case, you’re going to need to source pure samples for compounds to check that they are indeed bioactive. Second, you can use the use structures of the active compounds to set up queries for pharmacophore matching and molecular shape matching (see GGP1996 | N2010). Third, if structural information is available for the target you can investigate how the active compounds might be interacting with the target and use this information to source potentially active compounds (these days it is feasible to use free energy calculations to predict affinity in addition to the scoring functions that have long been used for virtual screening and I’ll point you to C2021 | MH2023 | C2023). Fourth, you can look for structure-activity relationships (see SHC2005 for an early example of this and the more recent S2025 study which provides software) in the bioactivity data and one way of achieving this is to search for ‘activity cliffs' (significant differences in bioactivity for pairs of structurally similar compounds; see M2006 | GvD2008 | SB2012 | SHB2019 | vT2022 ) or more generally by analysing bioactivity of neighbourhoods around bioactive compounds. Fifth, you can look for instances of increased polarity, such as replacement of aromatic CH with aromatic N) being well-tolerated from the perspective of bioactivity (this can be thought of both in terms of lipophilic efficiency and as a variation on the activity cliff theme). I should point out that the approaches that I've mentioned in this paragraph can be accommodated within an AI framework if you're prepared to think beyond ML in your definition of AI.

Let’s now suppose that you can satisfy the data requirements or building QSAR-like ML models for the target of interest with data in ChEMBL. Does this mean that you can whip up some QSAR-like ML models, fire up your generative AI and have clinical candidates condensing out of the ether? I think not and one implication of being able to satisfy the data requirements for building QSAR-like ML models is that others will have worked hard in the past trying to get to where you’d like to be in the future. Before you even start to build QSAR-like ML models you’ll need to assess the earlier work from the perspectives of both intellectual property and understanding why it didn't lead to clinical candidates. There are many rabbit holes that you can disappear down in drug discovery and here’s some advice from Otto von Bismarck (ironically it was a young, emotionally unstable, half-English Kaiser with a withered arm who brought down the Iron Chancellor):

Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.

If the available data do indeed satisfy the requirements for building QSAR-like ML models then it’s a pretty safe assumption that many of the data values will correspond to compounds from one or more structural series (see Figure 1 below which was taken from a previous post). Under this scenario the distribution of data points in the descriptor space is likely to be very uneven and you should anticipate that ‘global’ QSAR-like ML models built using such data will actually be ensembles of local models. One consequence of what I sometimes refer to as ‘clustering’ in the descriptor space is that what you might think is an interpolation is actually an extrapolation (take a look at the point highlighted by the arrow in Figure 1). Clustering in the descriptor space can also cause validation procedures to arrive at optimistic assessments of model quality because most data points have close neighbours and this can lead to overfitting (I discovered at EuroQSAR back in 2016 that some consider it rather uncouth to mention the H2003 study). Correlations between descriptors and related metrics such as Mahalanobis distance become less meaningful when there is a lot of clustering in the descriptor space. This in turn has implications for principal component analysis (commonly used to assess dimensionality of data sets and eliminate correlations between descriptors) and for methods such as PLS (see K1999) that aim to account for correlations between descriptors in regression analysis.

For reasons outlined in the previous paragraph I wouldn’t generally combine data from different structural series when building QSAR-like ML models. I would, however, look for relationships between different structural series by, for example, aligning their defining scaffolds (or structural prototypes if you prefer) because this may allow the SAR observed for one scaffold to be overlaid onto another scaffold. Before attempting to build a QSAR-like ML model I would plot pIC₅₀ of against calculated logP for structural series of interest with a view to assessing response of bioactivity to increased lipophilicity (a weak correlation between bioactivity and lipophilicity is desirable but if this is not the case then the response should be at least be relatively steep). I would also fit a straight line to the plot of pIC₅₀ versus calculated logP because this allows the steepness of the response to be quantified and the residuals can be used (as discussed in ‘Alternatives to ligand efficiency for normalization of affinity’ section of K2019) to quantify the extent to which individual pIC₅₀ values beat the trend in the data (this information can be useful to medicinal chemists who wish think about SAR although I have to admit that "the most interesting SAR is likely to be associated with the most deviant values" actually refers youthful antics of the Honourable former Member for Witney). Having performed these simple analyses of the bioactivity data I would attempt to build QSAR-like ML models for each structural series of interest.

This is a good point at which to wrap up and I'll share some thoughts on the use of QSAR-like ML models in drug design. Back in 2009 I discussed (see K2009) the difference between hypothesis-driven molecular design and prediction-driven molecular design and I suggest that the former can be accommodated within an AI design framework. Some who assert the value of QSAR-like ML models for drug design appear to treat drug design as an exercise in prediction and I've been crapping on for quite a few years (see this post from January 2015) is that it is more appropriately seen in a Design of Experiments framework (generate the necessary data as efficiently as possible). For many drug discovery projects the available data will not satisfy the requirements for building QSAR-like ML models until relatively late in the project and in some cases clinical candidates will be discovered without ever being able to satisfy the data requirements for building QSAR-like ML models (this is more likely to be the case when bioactivity cannot be represented by a single data value as is the case for modalities such as irreversible inhibition and targeted protein degradation). I consider it essential to account for numbers of adjustable parameters and for correlations between descriptors (or features if you prefer) when building QSAR-like ML models, and I’m also concerned that the challenges presented by clustering in descriptor spaces are not properly acknowledged. It also needs to be said that it is consideration of exposure that differentiates drug design from ligand design and I recommend that everybody working in drug discovery and chemical biology read the SR2019 article.

Property Forecast Index Validated

2025-04-01T00:29:00.012+01:00

<< previous || next >>

I arrived in Korea on Friday night and am greatly enjoying it here. Photos below show the Jungbu Dried Seafoods Market near where I'm staying and dinner on Sunday (spicy beef noodles).

I visited the War Memorial on Sunday and took selfies with the Shenyang J-6 (Chinese version of MiG-19) 'liberated' by Capt. Lee Woong-pyeong when he defected to South Korea on 25th February 1983, a 'liberated' T-34 (as Uncle Joe is said to have observed, quantity has a quality all of its own) and Great Leader's car (also 'liberated' although it was not clear exactly when).

So enough of the travel photos for now and let's get back to the science. Regular readers (both of them) of this blog will be well aware of my visceral dislike for drug design metrics. One reason for this visceral dislike is that I consider these metrics to trivialise the problems faced by medicinal chemists and I remain sceptical that one can make meaningful predictions of developability or likelihood of clinical success for compounds based only on their chemical structures without knowing anything about their biological activities. One metric that I have criticised harshly in the past is property forecast index (PFI) which was originally introduced as solubility forecast index (SFI). Specifically, I denounced SFI as a ‘draw pictures and wave arms’ data analysis strategy and privately I even considered the possibility that it had been created by a toddler armed with a box of colored crayons.

Let’s take a look at the HY2010 article in which SFI was introduced. Proprietary aqueous solubility measurements (continuous variable) were first processed to assign compounds to one of three aqueous solubility categories. Histograms showing the proportions of measurements in each aqueous solubility category were created by binning values of SFI and of c log D_pH7.4 and the histograms were compared visually:

This graded bar graph (Figure 9) can be compared with that shown in Figure 6b to show an increase in resolution when considering binned SFI versus binned c log D_pH7.4 alone.

Recently, I have been forced to revise my negative view of PFI and I have to admit that it pains me deeply to realise that I could have been so utterly wrong for so long in my assessment of what is actually an elegant and highly-predictive drug design metric. Indeed I have now come to the conclusion that the only reason that the Journal of Medicinal Chemistry did not include PFI in its nomination for the Nobel Prize in Physiology or Medicine was that the introduction of the Ro5, LipE and Fsp3 principles led directly to so many marketed drugs being approved.

What has caused such a fundamental shift in my views? First, PFI is highlighted in the European Federation of Medicinal Chemistry (EFMC) ‘Best Practices from Hits to Lead Generation’ webinar. Now it goes without saying that EFMC includes some of the sharpest minds in medicinal chemistry and, given that they consider PFI to be sufficiently important for inclusion in a best practices webinar, it became abundantly clear that I needed to revise my hopelessly naïve thinking. Let’s join the webinar at 27:53 and you’ll see in the webinar slide that SFI (as PFI was originally introduced) has been strongly endorsed by Practical Cheminformatics, a blog that many, including me, accept without question as the source of a number of fundamental ground truths in the AI field.

However, what convinced me of the sublime elegance and extreme predictivity of PFI is a seminal study by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, working in collaboration with the Budapest Enthalpomics Group (BEG). The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (not sure exactly what she’s doing these days although she did post a photo from Pyongyang showing her and a burly chap with a toothy grin and a bizarre haircut). There is no doubt that this genuinely disruptive study will comprehensively reshape the predictive ADME landscape, enabling drug discovery scientists, for the very first time, to make accurate predictions for developability and probability of critical trial success using only chemical structures as input.

Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that graphical presentation of categorized continuous data outperforms regression analysis performed on the uncategorized continuous data. The math is truly formidable (my rudimentary understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barone-Samedi transformation, followed by hepatic eigenvalue extraction using an algorithm devised by E. V. Tooms (a reclusive Baltimore resident whose illustrious research career in analytic topology was abruptly halted almost 31 years ago by an unfortunate escalator accident). The incisive analysis of Prof. Bouchard-Duvalier shows without a shadow of doubt that the data visualization used to establish PFI as a fundamental drug design principle will reliably and robustly outperform all AI approaches to prediction of aqueous solubility. Furthermore, ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which the beaming BEG director Prof. Kígyó Olaj explains that, “Possibilities are limitless now that we can accurately and robustly predict the developability of a compound using only its chemical structure as input and we can now finally consign regression analysis to the dustbin of history. Surely the Editors of Journal of Medicinal Chemistry will recognize the impact of PFI on real world drug discovery when they make their Nobel Prize nominations later this year.”

Thinking About Aqueous Solvation

2025-03-09T19:50:00.014+00:00

Given that it was International Women's Day yesterday, I'll open the the post (and blogging for 2025) with a photo of a gravestone at St James' Church in Bramley (Hampshire).

In the current post I’ll be taking a look at some aspects of aqueous solvation and Richard Wolfenden’s 1983 “Waterlogged Molecules” article (W1983) is still worth reading today (as an aside, Prof Wolfenden will turn ninety in May of this year and hopefully mentioning this won't put what is called "goat mouth" in my native Trinidad and Tobago on him as I did for Oscar Niemeyer with the words "ele vive ainda" while studying Portuguese in 2012). As noted in W1983 the formation of a target-ligand complex requires partial desolvation of both target and ligand:

When biological compounds combine, react with each other, or change shape in watery surroundings, solvent molecules tend to be reorganized in the neighborhood of the interacting groups.

Formation of a target-ligand can also be seen as an “exchange reaction” and this point is very well made in SGT2012:

Molecular binding in an aqueous solvent can be usefully viewed not as an association reaction, in which only new intermolecular interactions are introduced between receptor and ligand, but rather as an exchange reaction in which some receptor–solvent and ligand–solvent interactions present in the unbound state are lost to accommodate the gain of receptor–ligand interactions in the bound complex.

In HBD3 I briefly discuss ‘frustrated hydration’ as a phenomenon that could be exploited in drug design and I’ll quote from the Summary section of W1983:

When two or more functional groups are present within the same solute molecule, their combined effects on its free energy of solvation are commonly additive. Striking departures from additivity, observed in certain cases, indicate the existence of special interactions between different parts of a solute molecule and the water that surrounds it.

I’ll try to explain how this could work for ligand design and let’s suppose that we have two polar atoms that are close together in the binding site. The proximity of the polar atoms in the binding site means that water molecules forming ideal interactions with the polar atoms in the binding sites are also likely to be close together. However, the mutual proximity of the water molecules can lead to unfavourable interactions between the water molecules which ‘frustrate’ the (simultaneous) hydration of the two polar atoms in the binding site. Now if we design a ligand with two polar atoms positioned to form good interactions with polar atoms in the binding site it is likely that these will also be in close proximity and that their hydration will be similarly frustrated. I would generally anticipate that frustration of hydration will not be handled well by implicit solvent models (RT1999 | FB2004 | CBK2008 | KF2014) or computational tools such as WaterMap that calculate energetics for individual water molecules (especially in cases where the two hydration sites cannot be simultaneously occupied).

To illustrate frustration of hydration I’ve taken a graphic from a talk from 2023. The unfavorable interactions between solvating water molecules that frustrate hydration are shown as red double-headed water molecules (in some cases these interactions will be repulsive to the extent that only one of the hydration sites can be occupied at a time). You’ll also notice two thick green lines in the right hand panel and these show secondary interactions that stabilize the bound complex. Secondary interactions of this nature were discussed in a molecular recognition context in the JP1990 study and the observation (see A1989) that pyridazine is a better hydrogen bond acceptor (HBA) than its pK_a would have you believe can be seen in a similar light. Secondary interactions like these only enhance affinity when the proximal polar atoms are of the same ‘type’ (the proximal polar atoms in the 1,8-naphthyridine are both HBAs) and we should anticipate that the secondary interactions for the contact between pyrazole and the ‘hinge’ of a tyrosine kinase will be deleterious for affinity. In contrast to secondary interactions, frustration of hydration can be beneficial for affinity even when the proximal polar atoms are of opposite types, as would be the case for an HBA that is near to a hydrogen bond donor (HBD).

While it is clearly important to account for aqueous solvation when using physics-based approaches for prediction of binding affinity, passive permeability and aqueous solubility, the measurement of gas-to-water transfer free energy is not exactly routine (I’m not aware that any companies offer measurement aqueous solvation energy as a service nor do I believe that this is an activity that would readily funded). Measurements for aqueous solvation energy reported in the literature tend to be for relatively volatile compounds and I’ll direct readers to the C1981, W1981 and A1990 studies.

A view is that I've held for many years is that a partition coefficient could be used as an alternative to gas-to-water transfer free energy for studying aqueous solvation. It's also worth noting that when we think about desolvation in drug design we're often considering the energetic cost of bringing polar atoms into contact with non-polar atoms (as opposed to transferring the polar atoms to gas phase). Partition coefficient measurement is a lot more routine than solvation free energy measurement and most drug discovery scientists are of aware that the octanol/water partition coefficient (usually quoted as its base 10 logarithm logP) is an important design parameter. However, the octanol/water partition coefficient is not useful for assessing aqueous solvation because the hydroxyl group of octanol can form hydrogen bonds with solutes and the water-saturated solvent is actually quite 'wet' (the DC1992 study reports that the room temperature solubility of water in octanol is 2.5 M). If we’re going to use partition coefficient measurements for studying aqueous solvation then I would argue that we should make these measurements with a saturated hydrocarbon such as cyclohexane or hexadecane that lacks hydrogen bonding capability.

Here’s another slide from that 2023 talk showing that pyridine is lipophilic for octanol/water but hydrophilic for hexadecane/water. The difference in the logP values for a solute is sometimes referred to as ΔlogP (it is equivalent to the hexadecane/water logP value with both solvents water-saturated) and can be considered to quantify the solute’s ability to form hydrogen bonds (see Y1988 | A1994 | T2008). I'll mention in passing that ΔlogP measurements with toluene as the less polar organic solvent have been used to study intramolecular hydrogen bonding (see S2013 | C2016 | C2018).

It should be stressed that people have been thinking about using different organic solvents for partition coefficient measurement for a lot longer than me. My view, expressed in K2013, is that the justification in H1963 for using octanol was partly based on a misinterpretation of Collander's C1951 study. I really like this quote from Alan Finkelstein's 1976 article (as an aside the partition coefficient literature is not exactly awash with alkane/water logP measurements for amides and the article reports measured values of the hexadecane/water partition coefficient for acetamide, formamide, urea, butyramide and isobutyramide):

It has long been fashionable to worry about which organic solvent (and polarity) is the best model for the lipoidal region of a particular cell membrane (Collander, 1954). These solvents have ranged from isobutanol (the most polar) to olive oil (the least polar). I have never understood the point of this. If the lipoidal region of the plasma membrane is a lipid bilayer, then clearly the appropriate model solvent is hydrocarbon. For artificial bilayers this is obviously so. I chose n-hexadecane as the particular hydrocarbon, because its chain length is comparable to that of the fatty acid residues in most phospholipids, and it is conveniently available.

I also need to mention the B2016 study (Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge) since the the cyclohexane/water distribution coefficient was used as a surrogate for gas-to-water transfer free energy in the challenge:

The inclusion of distribution coefficients replaces the previous focus on hydration free energies which was a fixture of the past five challenges (SAMPL0-4) [1 | 2 | 3 | 4 | 5 | 6 | 7]. Due to a lack of ongoing experimental work to generate new data, hydration free energies are no longer a practical property to include in blind challenges. It has become increasingly difficult to find unpublished or obscure hydration free energies and therefore impossible to design a challenge focusing on target compounds, functional groups or chemical classes.

I consider initiatives such as the SAMPL5 cyclohexane/water distribution challenge to be valuable for assessing model predictivity in an objective and transparent manner. Generally, I would avoid including logD measurements for compounds that are significantly ionized under experimental conditions because these require that account be taken of ionization when making predictions (better to measure logD at a pH at which ionizable functional groups are not significantly ionized). While challenges such as SAMPL5 are certainly valuable for assessment of predictivity of models, I consider them less useful in model development which requires measured data for structurally-related compounds.

The isosteric pairs 1/2 and 3/4 shown in the graphic below will give you an idea of what I'm getting at. The predicted pK_BHX values taken from K2016 suggest that 1 is less polar than than its isostere 2 and I'd expect 3 to be more polar than 4.

While the three N-butylated purines shown in the graphic below are not strictly isosteric I would consider it valid to interpret the cyclohexane/water logP values taken from S1998 as reflecting differences in hydrogen bond acceptor strength.

This is a good point at which wrap up and, given the fundamental importance of aqueous solvation in biomolecular recognition and drug design, I see tangible advantages in having a large body of measured data in the public domain. My view is that to measure gas-to-water transfer free energy for significant numbers of compounds of interest to drug discovery scientists would be both technically demanding and unlikely to get funded although I would be delighted to be proven wrong on either point. This means that we need to learn to use other types of data in order to study aqueous solvation and my view is that an alkane/water partition coefficient would be the best option. Using alkane/water partition coefficients as an alternative to gas-to-water transfer free energies for studying aqueous solvation would also enable enthalpic (see RT1984) and volumetric aspects of aqueous solvation to be investigated more easily.

Natural Intelligence?

2024-12-31T22:55:00.037+00:00

**********************************************************

My pulse will be quickenin'

With each drop of strychnine

We feed to a pigeon

It just takes a smidgin

To poison a pigeon in the park

Tom Lehrer, Poisoning Pigeons in the Park | video

*********************************************************

I’ll be reviewing the H2024 study (Occurrence of “Natural Selection in Successful Small Molecule Drug Discovery) in this post. Derek has already posted on the H2024 study which has been included in the BL2024 Virtual Special Issue on natural products (NPs) in medicinal chemistry. I'll also mention reviews here at Molecular Design of the the related studies (4) (see post) and (24) (see post). As is usual for Molecular Design reviews of literature I have used the same reference numbers that were used in H2024 and quoted text is indented with any comments by me in square brackets and italicised in red. Given the serious concerns I have about H2024 this is going to be a long post and there are a couple of disclaimers that I need to make before starting the review:

I regard identification and biological characterisation of NPs as vital scientific activities that should be generously funded and Derek puts it very well in his recent post ("When you see specific and complex small molecules that living creatures are going to the metabolic trouble to prepare, there are surely survival-linked functions behind them."). In particular, I see it as important that NPs be screened in diverse phenotypic assays and here’s a link to the Chemical Probes Portal. While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these criticisms as indicative of an anti-NP position.
Automation of workflows (N2017) and generation of datasets from databases such as ChEMBL are far from trivial and (33), which highlights some of the challenges faced by researchers in this area, was the subject of a recent post at Molecular Design. I consider method development in this area to be an important cheminformatic activity that should be adequately supported. It must also be stressed that the design, building and updating of databases such as ChEMBL (G2012 | B2014 | P2015 | G2017 | 23) are vital scientific activities that should be generously funded (had it not been of the vision and foresight of the creators of the PDB over half a century ago it is improbable that the 2024 Chemistry Nobel Prize would have been awarded for “computational protein design” and “protein structure prediction”). While my criticisms of H2024 are certainly serious it would be grossly inaccurate to take these as criticisms of the automated dataset generation described in the study (and recently published in H2024b) or of the contributions by a number of individuals that have made ChEMBL an invaluable resource for drug discovery scientists and chemical biologists.

Hampi, November 2013

Having made the disclaimers, I’ll open my review of H2024 with some general observations. First, I do not consider that H2024 presents any insights of practical value to medicinal chemists nor do I consider the analyses presented in the study to support the assertion that “there is untapped potential awaiting exploitation, by applying nature’s building blocks─’natural intelligence’─to drug design” (in my view the use of the term “natural intelligence” does rather endow the study with what I’ll politely refer to as a distinctly pastoral odour). Second, the results of the analyses presented in H2024 do not demonstrate any tangible benefits from the drug design perspective of incorporating structural features that have been anointed as 'natural' by the authors (my view is that it would be extremely difficult to design data analyses to address the relevant questions in an objective manner). Third, the authors of H2024 present a ‘scaffold-centric’ view of NPs in which the naturalness of NPs is due to cyclic substructures present within their chemical (2D) structures (it is almost as if these 'natural' substructures are considered to be infused with 'vital force') and I would question whether this is a realistic view from the molecular recognition and physicochemical perspectives. Fourth, the meaning of what the authors of H2024 are calling 'enrichment' of pseudo-NPs (PNPs) in clinical compounds is unclear and, in any case, the 'enrichment' values do seem rather low (never more than twofold) when you consider the numbers of compounds that successful discovery project teams typically have to synthesize in order to deliver a drug that gets to market.

It's not clear (at least to me) what the authors of H2024 mean by ‘natural selection’ and at times their view of natural selection appears to be closer to Lysenkoism than Darwinism. For example, they assert in the conclusions section of H2024 that “NP structural motifs are provided predesigned by nature, constructed for biological purposes as a result of 4 billion years of evolution.” Design actually has no place in natural selection and perhaps the authors are thinking of 'Intelligent Design' which is a doctrine with many adherents in the Creationist community. While I don’t dispute that the chemical structures of many clinical compounds contain substructures that are also found in the chemical structures of NPs, I think that it would be extremely difficult to objectively compare different explanations for the observations (it's worth remembering that correlation does not imply causation). The explanation favoured by the authors of H2024 is that compounds assembled from Nature’s building blocks are ‘better’ and a stated aim of the study is “to seek further support for the existence of ‘natural selection’ in drug discovery” (this video will give readers an idea of what the late great Dave Allen might have made of this). In my view the data analyses presented in H2024 are not actually based on statistics and are therefore unfit for the purpose of testing hypotheses. Put another way, if you're going to use data analysis to look for something then it would be a good idea to use methods capable of telling you that you that haven't found what you were looking for.

The data analyses in H2024 are largely based on quantities (PNP_Status | Frag_coverage_Murcko | NP-likeness) that are calculated from the chemical (2D) structures of compounds. However, the authors do not state which software was used to perform the calculations and, had I been a reviewer, I would have drawn their attention to the following directive in the Data Requirements section in the J Med Chem Author Guidelines (accessed 27-Dec-2024):

9. Software. Software used as a part of computer-aided drug design should be readily available from reliable sources, and the authors should specify where the software can be obtained.

As was the case for my review of (24) I see much of the analysis in H2024 as relatively harmless “stamp collecting” (in contrast, as discussed in KM2013, I consider presentations and analyses of data that exaggerate trend strength, such as those used in the HMO2006, LS2007, LBH2009, HY2010 and TY2020 studies to be anything but harmless). The analyses that I’ll be examining in this post are of comparisons between clinical compounds and reference compounds although I'll comment in general terms on the analyses of time-dependencies of characteristics of clinical compounds. My general criticism of H2024 is not that the analyses presented by its authors are necessarily invalid but that they fail to provide any useful insight and I’ll share an insightful observation by Manfred Eigen (1927-2019):

A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right, but irrelevant.

I first encountered analyses of time-dependencies of drug properties about two decades ago and rapidly came to the conclusion that some senior medicinal chemists where I worked had a bit too much time on their hands. The fundamental flaw in the interpretation of these analyses is that time-dependencies of the properties of drugs and other clinical compounds are presented as causes rather than effects and it has never been clear how medicinal chemists working on drug discovery projects in the real world should use the results from such analyses. The authors claim that “changes to drug properties over time are significant” and I would challenge them to present even a single example of such analysis being used to meaningfully inform decision-making in a drug discovery project. It must be stressed that my criticism of analyses of time-dependency of the properties of drugs and other clinical compounds is simply that they don't provide useful insights and not that the analyses are necessarily invalid. That said, I do have general concerns about how time-dependencies are compared when some of the properties are expressed as logarithms and some are not. As reviewer I would have recommended that the vertical axis of the plot in the graphical abstract be drawn from 0% to 100% rather than from 30% to ~67%.

As is the case for analyses of time-dependency, my criticism of analyses of the differences between clinical compounds and reference compounds is that they don’t provide useful insight and there is no suggestion that the analyses are necessarily invalid. Before looking at the analyses presented in H2024 I’ll quote from the abstract of (24) because this will give you an idea of what I mean by analyses not providing useful insight:

Drugs are differentiated from target comparators by higher potency, ligand efficiency (LE), lipophilic ligand efficiency (LLE), and lower carboaromaticity.

As I noted in this post (this focused principally on the invalidity of the LE metric as discussed in NoLE) reporting that an analysis has shown drugs to be differentiated by potency from target comparators does seem to be stating the obvious and, given how LE and LLE are defined, it is perhaps not the most penetrating of insights to observe that values of these efficiency metrics tend to be greater for drugs than for comparator compounds. While the observation of lower carboaromaticity of drugs relative to comparator compounds is non-obvious, it does not constitute information that can be used for medicinal chemistry decision-making in specific discovery projects (as we noted in KM2013 carboaromaticity and lipophilicity can both be reduced simply by replacing a benzene ring with benzoquinone).

Let’s take a look at how this type of analysis is used in H2024. The authors of H2024 note that “comparing Figure 3a,b shows a clear ‘enrichment’ of PNPs in clinical compounds versus reference compounds in the post-2008 period” and two of these authors, writing in (17), assert that “PNPs have increasingly been explored in recent drug discovery programs, and are strongly enriched in clinical compounds”. What the authors of H2024 are calling 'enrichment' is rather different to the enrichment in structural features that results from high-throughput screening (HTS) and it’s important to understand the difference. Let’s suppose that we’ve screened a library of compounds of which 1% are pyrimidines and 1% are pyrazines and we find that 10% of the hits are pyrimidines and 0.1% are pyrazines (to simplify things you can assume there is no compound in the library with a pyrimidine and a pyrazine in its chemical structure). In this case we would conclude that the process of screening has resulted in a tenfold enrichment for pyrimidines and a tenfold impoverishment for pyrazines. Now let's create a 'selected azines' category by combining the pyrimidines and pyrazines which as a structural class comprise 2% of the screening library compounds but 10.1% of the hits. What I'm getting at here is that enrichment of an more inclusive structural class such as 'selected azines' (or PNPs) does not imply that each and every one of the structural classes covered by the inclusive structural class definition will also be enriched.

Now let’s take a look at how the 'enrichment' of PNPs in clinical compounds is assessed in H2024. First, a set of reference compounds is generated for each clinical compound (this is discussed in detail in H2024b) and the sets of reference compounds are combined. 'Enrichment' is then assessed by comparing the fraction of clinical compounds that are PNPs with the fraction of compounds in the combined reference sets that are PNPs. When we assess enrichment of chemotypes in HTS the hits are all selected (by the screening process) from the same reference pool of compounds. In contrast, each clinical compound in the H2024 analysis is associated with a different reference set of compounds (from the perspective of data analysis combining reference sets defined in this manner gratuitously throws information away). As a reviewer I would have pressed the authors to enlighten readers as to how they should interpret the proportions of PNPs in the reference sets for individual compounds.

It's worth thinking about what the reference compound set might look like for a clinical compound that is a PNP. The proportion of PNPs in the reference set will generally be influenced by factors such as availability of data, the ‘rarity’ of the structural features of the drug and the ‘tightness’ of the structure-activity relationship (SAR). A more permissive definition of ‘activity’ would generally be expected to make SAR appear to be less ‘tight’ (or ‘looser’ if you prefer). Compounds were defined as ‘active’ for the analysis on the basis of a recorded pChEMBL value against one of the clinical compound’s targets (as a reviewer I’d have suggested that the authors define the term ‘pChEMBL’) which means that a compound might have been selected for inclusion in a reference set on the basis of an IC₅₀ value of 100 μM.

Let’s define 'enrichment' by dividing the fraction of the clinical compounds that are PNPs by the fraction of reference compounds that are PNPs. When we select a reference set for a clinical compound that is a PNP then it’s extremely unlikely that every single compound in the reference set will also be a PNP (especially if we’re accepting compounds with IC₅₀ values 100 μM as ‘active’) and it’s even less likely that every single compound in the combined reference sets will be a PNP. This means that we should generally expect the clinical compounds that are PNPs to be ‘enriched’ in PNPs when compared with their combined reference sets. We can apply exactly the same logic to conclude that we should expect that the combined reference sets for the clinical compounds that are not PNPs (under this scenario we would conclude that the set of clinical compounds that are not PNPs are infinitely impoverished in PNPs when compared with their combined reference sets). This means that we should expect that the 'enrichment' of PNPs in the clinical compound set in comparison with their combined reference sets will increase with the fraction of clinical compounds that are PNPs.

Let’s take another look at the plot in the graphical abstract which shows the fractions of clinical compounds and reference compounds that are PNPs as a function of time. Notice how the lines tend to be furthest apart when the fraction of clinical compounds that are PNPs is relatively high. As a reviewer, I would have required that the authors examine the correlation between the logarithm of the fraction of clinical compounds and the logarithm of the enrichment (a relatively strong correlation would indicate that the information added by the combined reference sets is minimal). The 'enrichments' calculated from the plot in the graphical abstract are underwhelming (the highest degree of enrichment is the 2014 value of just over 1.5-fold and this value seems very low when you consider the numbers of compounds that successful discovery project teams typically need to synthesize in order to get drugs approved). From 2011 the fraction of clinical compounds that are PNPs exceeds 50% but I wouldn't consider it accurate to use the term "strongly enriched" (17) because the fraction of reference compounds that are PNPs is 40% or greater for this time period (plotting the vertical axis in the graphical abstract from 30% to ~67% creates the illusion that the 'enrichment' is greater than it actually is).

I do have a number of other gripes about the data analysis in H2024 but I do also need to take a look at PNPs and the following assertion by the authors is an appropriate point at which to start this discussion:

The PNP concept has been validated by its appearance in the literature (16,17) and by the design of several new classes of biologically active compounds. (18,19) [As a reviewer I would have pressed the authors to clearly articulate the “PNP concept” (just as I would have pressed the authors of this Editorial to clearly articulate the new principles that their nominees for the Nobel Prize in Physiology or Medicine had introduced). My view is that it is verging on megalomania to claim that a concept “has been validated by its appearance in the literature” and I don’t consider (18) to support the claim for “design of several new classes of biologically active compounds”. To support such a claim, one would ideally need to demonstrate that screening of libraries of compounds designed as PNPs resulted in the discovery of viable lead series against a range of therapeutic targets. At absolute minimum, one would need to show that libraries of compounds designed as PNPs exhibited exploitable activity across a range of target-related assays (although interesting, the results from the “cell painting assay” would not by themselves support a claim for “design of several new classes of biologically active compounds”). I should also mention that some in the compound quality field (see B2023 and my review of that article) interpret activity against multiple targets for a set of compounds based on a particular scaffold as evidence for pan-assay interference even when the individual compounds don’t themselves exhibit frequent-hitter behaviour. I don't have access to (19) and am therefore unable to assess the degree to which that article supports the authors claim for “design of several new classes of biologically active compounds”.]

The PNP status of a compound is determined by how “NP library fragments” (these are cyclic substructures extracted from the chemical structures of compounds in an NP-focussed screening library that had been generated over a decade ago for fragment-based drug discovery) are combined in its chemical structure.

PNP_Status. Compounds were assigned to one of four categories according to their NP fragment combination graphs. (16,17) The NP library fragments used for this purpose are Murcko scaffolds (26) [It would be actually more appropriate to refer to these as ‘Bemis scaffolds’ in order to properly recognize the corresponding author of this article.] (the core structures containing all rings without substituents except for double bonds, n = 1673) derived (16) from a representative set of 2000 NP fragment clusters. (15) [I see this approach as unlikely to capture all the relevant cyclic substructures present in NPs. My view is that it would have been better to first extract the relevant cyclic substructures from the chemical structures of all NPs for which this information is available, and then do the selection and filtering in one or more subsequent steps. The other advantage of doing things this way is that you’ll get a better assessment of the frequencies with which the different cyclic substructures occur in the chemical structures of NPs.] Because of their ubiquitous appearances in NPs, the phenyl ring and glucose moieties were specifically excluded as fragments. (16) [I would expect exclusion of the benzene ring (I consider ‘benzene ring’ more correct than ‘phenyl ring’ in this context) as a fragment to result is a significant reduction in number of the number of compounds that are considered to be PNPs (and, by implication, the ‘enrichment’ associated with membership of the PNP class). Even though the benzene ring has been excluded for the purpose of assigning PNP status it should still be considered to be one of Nature’s building blocks.]

As I mentioned earlier in the post, the view of NPs presented in H2024 is ‘scaffold-centric’ and I would question how realistic this view is given that non-scaffold atoms at the periphery of a molecular structure will generally be more exposed to targets (and anti-targets) than scaffold atoms at the core of the molecular structure. What I’m getting at here is that it is far from clear how much of a compound’s pharmacological activity can be attributed to the presence of individual substructural features in the chemical structure of the compound (modifying a point made in NoLE, I would argue that the contribution of a structural feature to the binding affinity of a compound is not actually an experimental observable). This is one reason that unless matched molecular pairs are available it would not generally be possible to demonstrate the superiority of one structural feature over another in an objective manner.

Something that you need to pay very close attention to when extracting substructures from chemical structures of compounds is the ‘environment’ of the substructure (I prefer to use the term ‘substructural context’). For example, two piperidine rings linked through nitrogen look very different from the perspective of a therapeutic target protein depending on whether the link is a carbonyl carbon or a tetrahedral carbon (most medicinal chemists will be aware that the protonation states differ but there are also subtle, although still significant, differences in the shape of the piperidine ring in the two substructures). You also need to be aware that fusing rings can have profound effects on physicochemical characteristics and I would consider it a bad idea to extract monocyclic substructures from fused or bicyclic ring systems.

There are some things that don't look quite right and I would have flagged these up if I’d been reviewing the manuscript. Let’s take a look at the first entry (Sotorasib) in Table 1 and you can see that the oxygen of the 2-pyrimidone substructure is coloured lilac indicating that this substructure can be found in the chemical structures of one or more NPs (I would still challenge the view that the result of fusing 2-pyrimidone with pyridine should be considered 'natural' on the basis that the heterocycles from which it is derived from are both found in chemical structures of NPs). Now take a look the second entry (Dolutegravir) in Table 1 and you'll notice that the oxygen in the 4-pyridone substructure is not coloured green. This implies that 4-pyridone does not occur in the chemical structure of any NP and, in the absence of information, I can only assume that it has been anointed as 'natural' because of its structural analogy with pyridine (while there is a nitrogen atom and five trigonal carbon atoms in each substructure the molecular recognition characteristics of the two substructures differ far too much for them to be regarded as equivalent from the perspective of assigning PNP status). Six of the substructures in Figure 5 appear to be in unstable tautomeric forms (first, fifth, ninth, twelfth entries in line 2 | seventh entry in line 3 | first entry in line 5).

I'll conclude my review of H2024 by commenting on claims made by the authors:

This is further evidence that the three NP metrics can be considered as independent measures of clinical compound quality. [I would consider the claim that any of these “NP metrics” can be considered as a measure of“clinical compound quality” to be wildly extravagant (the authors haven't even stated how "clinical compound quality" is defined yet they claim to be able to measure it). I would argue that compound quality cannot be meaningfully compared for clinical compounds that have been developed for different diseases or disorders. Describing a compound as 'clinical' implies that a large body of measured data has actually been generated for it and the authors of H2024 might find it instructive to ask themselves why they think a simple metric calculated from the chemical structure of the compound would be of interest to a project team with access to this large of body of measured data One criticism that I make of drug discovery metrics is that they trivialize drug discovery and we noted in KM2013: “Given that drug discovery would appear to be anything but simple, the simplicity of a drug-likeness model could actually be taken as evidence for its irrelevance to drug discovery.” ]

The overall results are supportive of the occurrence of “natural selection” being associated with many successful drug discovery campaigns. [My view is that the authors of H2024 have not clearly articulated what they mean by“natural selection” in the context of this study.] It has been proposed that NP-likeness assists drug distribution by membrane transporters, (21) [The author of (20c) asserts "Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low" and, by implication, that entry of the vast majority of drugs into cells is transporter mediated. I keep an open mind on this issue although I note that what is touted by some as a universal phenomenon does seem to have been remarkably difficult to observe directly by experiment. The difficulties caused by active efflux are widely recognized by drug discovery scientists and it may be instructive for the authors of H2024 to consider how an experienced medicinal chemist working in the CNS area might view a suggestion that compounds should be made more like NPs to increase the likelihood of being transporter substrates.] and we further speculate that employing NP fragments may result in less attrition due to toxicity, a major cause of preclinical failure. (55) [This does seem to be grasping at straws. The focus of the cited article is actually clinical failure and not preclinical failure.]

There is untapped potential for further exploitation of currently used and unused NP fragments, especially in fragment combinations and the design of PNPs, without the need to resort to chemically diverse ring systems and scaffolds. [This exemplifies what can be called the ‘Ro5 mentality’ (‘experts’ advising medicinal chemists to not explore but to focus on regions of chemical space that have been blessed by the ‘experts’). As I note in this blog post Ro5 (as it is stated) is not actually supported by data and in NoLE, I advise drug designers not to “automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.” An equally plausible 'explanation' for the observation that a high fraction of clinical compounds are PNPs is simply that medicinal chemists are working with what they're most familiar with (in this case the advice would be to look beyond Nature's building blocks for inspiration).] To exploit these opportunities, “NP awareness” needs to be added to the repertoire of medicinal chemists. [My view is that it would be more important for critical thinking to be added to the repertoire of medicinal chemists so they are better equipped to assess the extent to which conclusions and recommendations of studies like H2024 are actually supported by data.]

In short, applying nature’s building blocks─natural intelligence─to drug design can enhance the opportunities now offered by artificial intelligence. [In my view "natural intelligence" appears to be arm-waving that is neither natural nor intelligent.]

This is a good point to wrap up and to also conclude blogging for the year. My new year wish is for a kinder, happier and more peaceful World in 2025 and I'll leave you with a photo of BB and Coco in the study here in Maraval. They had been helping me with this post before I unwisely decided to explain ligand efficiency to them. Let sleeping dogs lie I guess.

Assessment of AI-generated chemical structures using ML

2024-10-20T18:39:00.024+01:00

Previous << || >> Next

In an earlier post I considered what it might mean to describe drug design as AI-based. In this post I’ll take a general look at using machine learning (ML) to predict biological activity (and other pharmaceutically-relevant properties) for AI-generated chemical structures. Whether or not ML models ultimately prove to be fit for this purpose it is worth pointing out that many visionaries and thought leaders who tout computation as a panacea for humanity’s ills fail to recognize the complexity of biology (take a look at In The Pipeline posts from 2007 | 2015 | 2024). One point worth emphasizing in connection with the complexity of biology is that it is not currently possible to measure the concentration of a drug at its site of action for intracellular targets in live humans (here's an article on intracellular and intraorgan drug concentration that I recommend to everybody working in drug discovery and chemical biology). While I won't actually be saying anything about AI (here's a recent post from In The Pipeline that takes a look at how things are going for early movers in the field of AI drug discovery) in the current post I'll reiterate the point with which I concluded the earlier post:

One error commonly made by people with an AI/ML focus is to consider drug design purely as an exercise in prediction while, in reality, drug design should be seen more in a Design of Experiments framework.

In that earlier post I noted that there’s a bit more to drug design than simply generating novel molecular structures and suggesting how the compounds should be synthesized. While I'm certainly not denying the challenges presented by the complexity of biology the current post will focus on some of the challenges associated with assessing chemical structures churned out by generative AI. One way of doing this is to build models for predicting biological activity and other pharmaceutically relevant properties such as aqueous solubility, permeability and metabolic stability. This is something that people have been trying to do for many years and the term ‘Quantitative Structure-Activity Relationship’ (QSAR) has been in use for over half a century (the inaugural EuroQSAR conference was held in Prague in 1973 a mere five years after Czechoslovakia had been invaded by the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). My view is that many of the ML models that get built with drug design in mind could accurately be described as QSAR models and I would not describe QSAR models as AI.

In the current post, I'll be discussing ML models for predicting quantities such as potency, aqueous solubility and permeability that are continuous variables which I refer to as 'regression-based ML models' (while some readers will not be happy with this label I do need to make it absolutely clear that the post is about one type of ML model and the label 'QSAR-like' could also have been used). I’ll leave classification models for another post although it’s worth mentioning that genuinely categorical data are actually rare in drug discovery (you should always be wary of gratuitous categorization of continuous data since this is a popular way to disguise the weakness of trends and KM2013 will give you some tips on what to look out for). It also needs to be stressed that the ML is a very broad label and that utility in one area (prediction of protein-folding for example) doesn't mean that that ML models will necessarily prove useful in other area.

To build a regression-based ML model you first need to assemble a training set of compounds for which the appropriate measurements have been made and pIC₅₀ values are commonly used to quantify biological activity (I recommend reading the LR2024 study on combining results from different assays although, as discussed in this post, I don’t consider it meaningful to combine data from multiple pairs of assays when calculating correlation-based metrics for assay compatibility). Next, you calculate values of descriptors for the chemical structures of the compounds in your training set (descriptors are typically derived from the connectivity in the chemical structure although atom counts and predicted values of physicochemical properties are also used). Finally, you use the ML modelling tools to find a function of the descriptors that best predicts the biological activity (or a pharmaceutically-relevant property) for the compounds in the training set. Generally you should also validate your models and this is especially important for models with large numbers of adjustable parameters.

There appears to be a general consensus that you need plenty of data for building ML models and some will even say “quantity has a quality all of its own” (this is sometimes stated as Stalin’s view of the T-34 tank although I consider this unlikely and the T-34 was actually an excellent tank which also happened to get produced in large numbers). Most people building regression-based ML models are also aware that you need a sufficiently wide spread in the measured data used for training the model (the variance in the measured data should be large in comparison with the precision of the measurement). Lead optimization is typically done within structural series and building a regression-based ML model that is predictively useful is likely to require data that have been measured for compounds in the structural series of interest. These data requirements are quite stringent and I see this as one reason that QSAR approaches do not appear to have had much impact on the discovery of drugs despite the drug discovery literature being awash with QSAR articles. Back in 2009 (see K2009) I compared prediction-driven drug design with hypothesis-driven drug design, noting that the former is often not viable and that the latter is more commonly used in pharmaceutical and agrochemical discovery (former colleagues discussed hypothesis-driven molecular design in the context of the design-make-test-analyse cycle in the P2012 article).

With freshly painted T-34 at Brest Fortress, Belarus (June 2017)

There are some other points that you need to pay attention to when building regression-based ML models. First, replicate measurements for the response variable (the quantity that you’re trying to predict) should be normally distributed and this is one reason why we model pIC₅₀ rather than IC₅₀. Second, the data values for the training set should be uniformly distributed in the descriptor space (my view, expressed in B2009, is that many 'global' predictive models are actually ensembles of local models). Third, the descriptors should not be strongly correlated or the method used to build the regression-based ML model must be able to account for relationships between descriptors (while it’s relatively straightforward to handle linear relationships between descriptors in simple regression analysis it’s not clear how effectively this can be achieved with more sophisticated algorithms used for building regression-based ML models).

I’ve created a graphic (Figure 1) to illustrate some of the modelling difficulties that result from uneven coverage in the descriptor space and it goes without saying that reality will be way more complex. The entities that occupy this chemical space are compounds and the coordinates of a point show the values of the descriptors X₁ and X₂ that have been calculated from the corresponding chemical structures (the terms ‘2D structure’ and ‘molecular graph’ also used). I’ve depicted real compounds for which measured data are available as black circles and virtual compounds (for which predictions are to be made) as five-pointed stars. The clusters (color-coded but also labelled A, B and C in case any readers are colour blind) are much more clearly defined than would be the case in a real chemical space. Proximity in chemical space implies similarity between compounds and the clusters might correspond to three different structural series.

Let’s suppose that we’ve been able to build a useful local model to predict pIC₅₀ for each cluster even though we’ve not been able to build a predictively useful global model. Under this scenario you’d have a relatively high degree of confidence in the pIC₅₀ values predicted for the virtual compounds (depicted as five-pointed stars) that lie within the clusters and a much lower degree of confidence in the virtual compound that is indicated by the arrow. If, however, we were to ignore the structure of the data and take a purely global view then we would conclude that the virtual compound indicated by the arrow occupied a central location in this region of chemical space and that the other three virtual compounds occupied peripheral locations. Put another way, the applicability domain of the model is not a single contiguous region of chemical space and what would appear to be an interpolation by a model is actually an extrapolation.

It is important to take account of correlations between descriptors when building prediction models. A commonly employed tactic is to perform principal component analysis (PCA) which generates a new set of orthogonal descriptors and also provides an assessment of the dimensionality of the descriptor space. There are also ways to deal with correlations between descriptors in the model building process (PLS is the best known of these and the K1999 review might also be of interest). Correlations between descriptors also complicate interpretation of ML models and my stock response to any claim that an ML model is interpretable would be to ask how relationships between descriptors had been accounted for in the modelling of the data. An excellent illustrative example (see L2012) of a correlation between descriptors is the tendency of the presence of a basic nitrogen in a chemical structure to be associated with higher values of the Fsp³ descriptor (which, as pointed out in this post, should really be referred to as the I_ALI descriptor).

Let’s take another look at Figure 1. The axes of the ellipse representing Cluster A are aligned with the axes of the figure which tells us that X₁ and X₂ are uncorrelated for the compounds in this cluster. Cluster B is also represented by an ellipse although its axes are not aligned with the axes of the figure which implies a linear correlation between X₁ and X₂ for the compounds in this cluster (you can use PCA to create two new orthogonal descriptors by rotating the plot around an axis that is perpendicular to the X₁-X₂ plane). Cluster C is a bigger problem because the correlation between X₁ and X₂ is non-linear (the cluster is not represented as an ellipse) and it would be rather more difficult to generate two new orthogonal descriptors for the compounds in this cluster. My view is that PCA is less meaningful when there is a lot of clustering in data sets and I would also question the value of PLS and related methods in these situations.

Let’s consider another scenario by supposing that we’ve been unable to build a useful local model for prediction of any of the three clusters in Figure 1. If, however, the average pIC₅₀ values differ for each of the three clusters we can still extract some predictivity from the data by finding a function of X₁ and X₂ that correlates with the average pIC₅₀ values for the clusters. This is one way that clustering of compounds in the descriptor space can trick you into thinking that a global model has a broader applicability domain than is actually the case. Under this scenario it would be very unwise to try to interpret the model or use it to make predictions for compounds that sit outside the clusters.

This is a good point at which to wrap up my post on regression-based ML (or QSAR-like if you prefer) models for predicting biological activity and other properties relevant to drug design such as aqueous solubility, permeability and metabolic stability. There appears to be a general consensus that building these models requires a lot of data and, in my view, this means that models like these are actually of limited utility in real world drug design. The basic difficulty is that a project team with enough data for building useful regression-based ML models is likely to be at a relatively advanced stage (the medicinal chemists will already understand the structure-activity relationships and be aware of project-specific issues such as poor aqueous solubility or high turnover by metabolic enzymes). Drug discovery scientists tend to be less aware of the problems that arise from clustering of compounds in descriptor space and, in my view, this is a factor that should be considered by those seeking to assemble data sets for benchmarking (see W2024). I'll leave you with a suggestion (it was considered a terrible idea at the time and probably still is by most ML thought leaders) I made over twenty years ago that each predicted value should be accompanied by chemical structures and measured values for the three closest neighbours in the descriptor space of the model.

Variability in biological activity measurements reported in the drug discovery literature

2024-09-18T22:24:00.013+01:00

I'll open the post with a panorama from the summit of Shutlingsloe, sometimes referred to as Cheshire's Matterhorn, which at 506 m above sea level, is the third highest point in the county. When in the UK, I usually come here to mark the solstices and there's usually a good crowd here for the occasion (the winter solstice tends to be less well attended).

The LR2024 study (Combining IC₅₀or K_i Values from Different Sources Is a Source of Significant Noise) that I’ll be discussing in this post highlights one of the issues that you’re likely to encounter should as you be using public domain databases such as ChEMBL to create datasets for building machine learning (ML) models for biological activity. The LR2024 study has already been reviewed in a Practical Fragments post (The limits of published data) and, using the same reference numbers as were used in the study, I’ll also mention 10 (The Experimental Uncertainty of Heterogeneous Public K_i Data) and 11 (Comparability of Mixed IC₅₀ Data – A Statistical Analysis). The variability in biological activity data highlighted by LR2024 stems in part from the fact that the term IC₅₀ may refer to different quantities even when measurements are performed for the same target and inhibitor/ligand (the issue doesn’t entirely disappear when you use K_i values). I have two general concerns with the analysis LR2024 study. First, it is unclear whether the ChEMBL curation process captures assay conditions in sufficient detail to enable the user to establish that two IC₅₀ values can be regarded as replicates of the same experiment (I stress that this is not a criticism of the curation process). Second, combining data for different pairs of assays for calculation of correlation-based measures of assay compatibility can lead to correlation inflation. One minor gripe that I do have with the LR2024 study concerns the use of the term “noise” which, in my view, should only refer to variation in values measured under identical conditions.

I'll review LR2024 in the first part of the post before discussing points not covered by the study such as irreversible inhibition and assay interference (these can cause systematic differences in IC₅₀ values to be observed for a particular combination of target and inhibitor even when the assays use the same substrate at the same concentration). There will be a follow up post covering how I would assemble data sets for building ML models for biological activity with some thoughts on assessment and curation of published biological activity data. As is usual for blog posts here at Molecular Design, quoted text is indented with my comments enclosed in square brackets in red italics.

In the Compatibility Issues section the authors state:

Looking beyond laboratory-to-laboratory variability of assays that are nominally the same, there are numerous reasons why literature results for different assays measured against the same “target” may not be comparable. These include the following:

Different assay conditions: these can include different buffers, experimental pH, temperature, and duration. [Biochemical assays are usually run at human body temperature (37°C) although assay temperature is not always reported. The term 'duration' is pertinent to irreversible inhibition and one has to be very careful when comparing IC₅₀ values for irreversible inhibitors. It's worth mentioning that a significant reduction in activity when an assay is run in the presence of detergent (see FS2006) is diagnostic of inhibition by colloidal aggregates (see McG2003). I categorized inhibition of this nature as “type 2 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial.]
Substrate identity and concentration: these are particularly relevant for IC₅₀ values from competition assays, where the identity and concentration of the substrate being competed with play an important role in determining the results. K_i measures the binding affinity of a ligand to an enzyme and so its values are, in principle, not sensitive to the identity or concentration of the substrate. [My view is that one would generally need to establish that IC₅₀ values had been determined using the same substrate and same substrate concentration if interpreting variation in the IC₅₀ values as "noise" and it's not clear that the substrate-related information needed to establish the comparability of IC₅₀ determinations is currently stored in ChEMBL. If concentrations and K_m values are known it may be practical to use the Cheng Prusoff equation (see CP1973) to combine IC₅₀ values measured that have been measured using different concentrations of substrate (or cofactor). It's worth noting that enzyme inhibition studies are commonly run with the substrate concentration at its K_m value (see Assay Guidance Manual: Basics of Enzymatic Assays for HTS NBK92007) and there is a good chance that assays against a target using a particular substrate will have been run using very similar concentrations of the substrate. It is important to be specially careful when analysing kinase IC₅₀ data because assays are sometimes run at high ATP concentration in order to simulate intracellular conditions (see GG2021).]
Different assay technologies: since typical biochemical assays do not directly measure ligand–protein binding, the idiosrasies of different assay technologies can lead to different results for the same ligand–protein pair. (7) [Significant differences in IC₅₀ (or K_i) values measured for a particular combination of target and compound using different assay read-outs are indicative of interference and I’ll discuss this point in more detail later in the post.]
Mode of action for receptors: EC₅₀ values can correspond to agonism, antagonism, inverse agonism, etc. [The difficulty here stems from not being able to fully characterize the activity in terms of a concentration response (for example, agonists are characterised by both affinity and efficacy).]

The situation is further complicated when working with databases like ChEMBL, which curate literature data sets:

Different targets: different variants of the same parent protein are assigned the same target ID in ChEMBL [My view is that one needs to be absolutely certain that assays have been performed using identical (including with respect to post-translational modifications) targets before interpreting differences in IC₅₀or K_i values as noise or experimental error.]
Different assay organism or cell types: the target protein may be recombinantly expressed in different cell types (the target ID in ChEMBL is assigned based on the original source of the target), or the assays may be run using different cell types. [There does appear to be some confusion here and it would not generally be valid to valid to assign a ChEMBL target ID to a cell-based assay.]
Any data source can contain human errors like transcription errors or reporting incorrect units. These may be present in the original publication─when the authors report the wrong units or include results from other publications with the wrong units─or introduced during the data extraction process.

The authors describe a number of metrics for quantifying compatibility of pairs of assays in the Methods section of LR2024. My view is that compatibility between assays should be quantified in terms of differences between pIC₅₀ (or pK_i) values and I consider correlation-based metrics to be less useful for this purpose. The degree to which pIC₅₀ values for two assays run against a target are correlated reflects the (random) noise in each assay and the range (more accurately the variance) in the pIC₅₀ values measured for all the compounds in each assay. Let’s consider a couple of scenarios. First, results from two assays are highly correlated but significantly offset from each other to a consistent extent (the assays might, for example, measure IC₅₀ for a particular target using different substrates). Under this scenario it would be valid to include results from both assays in a single analysis (for example, by using the observed offset between pIC₅₀ values as a correction factor) even though it would not be valid to treat the pIC₅₀ values for compounds in the two assays as equivalent. In the second scenario, the correlation between the assays is limited by the narrowness of the range in the IC₅₀ values measured for the compounds in the two assays. Under this scenario, differences between the pIC₅₀ values measured for each compound can still be used to assess the compatibility of the two assays even though the range in the IC₅₀ values is too narrow for a correlation-based metric to be useful.

The compatibility between the two assays was measured by comparing pchembl values of overlapping compounds. [The term pchembl does need to be defined.] In addition to plotting the values, a number of metrics were used to quantify the degree of compatibility between assay pairs:

R²: the coefficient of determination provides a direct measure of how well the “duplicate” values in the two assays agree with each other. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
Kendall τ: nonparametric measure of how equivalent the rankings of the measurements in the two assays are. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
f > 0.3: fraction of the pairs where the difference is above the estimated experimental error. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC₅₀ values is greater than the uncertainty in either pIC₅₀ value (an uncertainty of 0.3 in ΔpIC50 would correspond to an uncertainty of 0.2 in each of the IC₅₀ values from which the difference had been calculated.]
f > 1.0: fraction of the pairs where the difference is more than one log unit. This is an arbitrary limit for a truly meaningful activity difference. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC₅₀ values is greater than the uncertainty in either pIC₅₀ value (an uncertainty of 1.0 in ΔpIC₅₀ would correspond to an uncertainty of 0.7 in each of the IC₅₀ values from which the difference had been calculated.]
κbin: Cohen’s κ calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous data prior to assessment of correlations because the operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]
MCCbin: Matthew’s correlation coefficient calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous prior to assessment of correlations because this operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]

Let’s take a look at some of the results reported in the LR2024 study and it’s interesting that f > 0.3 and f > 1.0 values were comparable for IC₅₀ and K_i measurements. This is an important result since K_i values do not depend on the concentration and K_m of the substrate (or cofactor) and I would generally anticipate greater variation in IC₅₀ values measured for each compound-target pair than for the corresponding K_i values.

We first looked at the variation in the data sets when IC₅₀ assays are combined using “only activity” curation (top panels in Figure 2). The noise level in this case is very high: 64% of the Δpchembl values are greater than 0.3, and 27% are greater than 1.0. The analogous plot for the K_i data sets is shown in Figure S1 in the Supporting Information. The noise level for K_i is comparable: 67% of the Δpchembl values are greater than 0.3, and 30% are greater than 1.0.

I consider it valid to combine data for different pairs of assays for analysis of ΔpIC₅₀ or ΔpK_i values. However, I have significant concerns about the validity of combining data for different pairs of assays for analysis of correlations between pIC₅₀ or pK_i values. The authors of LR2024 state:

In Figure 2 and all similar plots in this study, the points are plotted such that the assay on the x-axis has a higher assay_id (this is the assay key in the SQL database, not the assay ChEMBL ID that is more familiar to users of the ChEMBL web interface) in ChEMBL32 than the assay on the y-axis. Given that assay_ids are assigned sequentially in the ChEMBL database, this means that the x-value of each point is most likely from a more recent publication than the y-value. We do not believe that this fact introduces any significant bias into our analysis.

I see two problems (one minor and one major) in preparing data in this manner for plotting and analysis of correlations over a number of assay pairs. The minor problem is that exchanging assay1 with assay2 for some of the assay pairs will generally result in different values for the correlation-based metrics for compatibility of assays. While I don’t anticipate that the differences would be large the value of a correlation-based metric for assay compatibility really shouldn’t depend on the ordering of the assays. Furthermore, the issue can be resolved by symmetrizing the dataset so that each of the pair of assay results for each compound is included both as the x-value and as the y-value. Symmetrizing the dataset in this manner doubles the number of data points and one would need to be careful if estimating confidence intervals for the correlation-based metrics for assay compatibility. I think that it would be appropriate apply a weight of 0.5 to each data point for estimation of confidence intervals although I would certainly be consulting a statistician before doing this.

However, there is also another problem (which I don't consider to be minor) with combining data for assay pairs when analysing correlations. The value of a correlation-based metric for assay compatibility reflects the variance in ΔpIC₅₀ (or ΔpK_i) values and the variance in the pIC₅₀ (or pK_i) values. The variance in pIC₅₀ (or pK_i) values when different pairs of assays that have been combined would generally be expected to be greater than for the datasets corresponding to the individual assay pairs. Under this scenario I believe that it would be accurate to describe the correlation metrics calculated for the aggregated data as inflated (see KM2013 and the comments made therein on the HMO2016 , LS2007 and LBH2009 studies) and as a reviewer of the manuscript I would have suggested that the distribution over all assay pairs be shown for each correlation-based assay compatibility metric. When considering correlations between assays it can also be helpful, although not strictly correct, to think in terms of ranges in pIC₅₀ values. For example, the range in pIC₅₀ values for “only activity curation” in Figure 2 appears to be about 7 log units (I’d be extremely surprised if the range in pIC₅₀ values for any of the individual assays even approached this figure). My view is that correlation-based metrics are not meaningful when data for multiple pairs of assays have been combined although I don't think any real harm has been done given that the authors certainly weren't trying to 'talk up' strengths of trends on the basis of the values of the correlation-based metrics. However, there is a scenario under which this type of correlation inflation would be a much bigger problem and that would be when using measures of correlation to compare measured ΔG values with values that had been calculated by free energy perturbation using different reference compounds.

So far in the post the focus has been on the analysis presented in LR2024 and now I’ll change direction by discussing a couple of topics that were not covered in that study. I’ll start by looking at irreversible mechanisms of action and the (S2017 | McW2021 | H12024) articles cover irreversible covalent inhibition (this is the irreversible mechanism of action that ChEMBL users are most likely encounter). You need two parameters to characterize irreversible covalent inhibition (K_i and k_inact respectively quantify the affinity of the ligand for target and the rate at which the non-covalently bound ligand becomes covalently bound to target). While it is common to encounter IC₅₀ values in the literature for irreversible covalent inhibitors these are not true concentration responses because the IC₅₀ values also depend on factors such as pre-incubation time. Another difficulty is that articles reporting IC₅₀ values for irreversible covalent inhibitors don’t always explicitly state that the inhibition is irreversible.

As the authors of LR2024 correctly note differences between IC₅₀ values may be the result of using different assay technologies. Interference with assay read-out (I categorized this as “type 1 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial) should always be considered as a potential explanation for significant differences between IC₅₀ values measured for a given combination of target and inhibitor when different assay technologies are used. An article that I recommend for learning more about this problem is SWK2009 which explains how UV/Vis absorption and fluorescence by inhibitors can cause interference with assay read-outs (the study also shows how interference can be assessed and even corrected for). When examining differences between IC₅₀ values for the same combination of target and inhibitor it's worth bearing in mind that interference with assay read-outs tends to be more of an issue at high concentration (this is why biophysical assays tend to be favored for screening fragments). From the data analysis perspective, it’s usually safe to assume that enzyme inhibition assays using the same substrate also use the same type of assay read-out.

Differences in the technology used to prepare the solutions for assays is another potential cause of variation in IC₅₀ values. For example, a 2010 AstraZeneca patent (US7718653B2) disclosed significant differences in IC₅₀ values depending on whether acoustic dispensing or serial dilution was used for preparation of solutions for assay. Compounds were observed to be more potent when acoustic dispensing was used and the differences in IC₅₀ values point to an aqueous solubility issue. The data in US7718653B2 formed the basis for the EOW2013 study.

So that brings us to the end of my review of the LR2024 study and I’ll be doing a follow up post later in the year. One big difficulty in analysing differences between measured quantities is determining the extent to which measured values are directly comparable when IC₅₀ can be influenced by factors such as the technology used to prepare assay solutions. Something that I think would have been worth investigating is the extent to which variability of measured values depends on potency (pIC₅₀ measurements might be inherently more variable for less potent inhibitors than for highly potent inhibitors). The most serious criticism that I would make of LR2024 is it is not meaningful to combine data for different pairs of assays when calculating correlation-based measures of assay compatibility.

A Nobel for property-based drug design?

2024-07-30T23:04:00.041+01:00

[This post was updated on 10-Aug-2025 to mention my review of CNM2025 (Return to Flatland) which critically examines (35) (Escape from Flatland: Increasing Saturation as an Approach to Improving Clinical Success)]

This post was updated on 04-Aug-2024. I thank Tim Ritchie (see RM2009 | RM2014) for bringing YG2003 (Prediction of Aqueous Solubility of Organic Compounds by Topological Descriptors) to my attention.]

"The problems of ADME are precisely those that determine success or failure of a drug in vivo. In vitro data can give a clearer picture of the receptor characteristics, but knowledge and control of ADME are also vital. A common trap in binding studies is that binding generally increases with lipophilicity, so that one may obtain extremely potent binding that is totally unattainable in vivo."

SH Unger (1987) Computer-Aided Drug Design in the Year 2000.

Drug Information Journal 21:267-275 DOI

******************************************

In this post I’ll be reviewing an Editorial (Property-Based Drug Design Merits a Nobel Prize) that was recently published in J Med Chem. For me, the Editorial raises questions about the critical thinking skills of its authors and of the judgement of the J Med Chem Editors (I’m guessing that some of the courteous and cultured members of the Nobel Prize committee might regard it to be somewhat pushy, and possibly even uncouth, for journals to be publishing nominations for Nobel Prizes as editorials). My advice to anybody nominating individuals for a Nobel Prize is to be aware of an observation, usually attributed to Jocelyn Bell Burnett, that it’s better that people ask why you didn’t win a Nobel Prize than why you did. Where applicable, I've used the the same reference numbers that were used in the Editorial and I’ll start by reproducing the Nobel Prize proposal (as is usual in posts at Molecular Design, I’ve inserted some comments, italicized in red and enclosed in square brackets, into the quoted text):

We propose that a Nobel Prize in Physiology or Medicine should be awarded for property-based drug design, with Christopher A. Lipinski, Paul D. Leeson, and Frank Lovering as the proposed recipients for their development of “important principles for drug design” [I would describe what the proposed Nobel laureates have introduced as a rule, a metric and a molecular descriptor rather than principles.], principles that have contributed to the development of numerous approved drugs. [The authors do need to provide convincing evidence to support what appear to be some wildly extravagant claims. Specifically, the authors need to demonstrate that the rule, metric and molecular descriptor (which they describe as “principles”) were actually critical to the decision-making in projects that led to the development of numerous drugs.] While drug design previously focused primarily on optimizing potency, they introduced a more holistic approach based on the consideration of how fundamental molecular and physicochemical properties affect pharmaceutical, pharmacodynamic, pharmacokinetic, and safety properties. [My view is that none of proposed Nobel laureates even demonstrated a single convincing link between molecular and physicochemical properties, and pharmaceutical, pharmacodynamic, pharmacokinetic, and safety properties.] The development of the Rof5 by Christopher A. Lipinski in 1997 introduced a new principle for how molecular and physicochemical properties affect oral bioavailability. The development of LipE by Paul D. Leeson in 2007 introduced a new principle for how physicochemical properties impact potency, selectivity, and safety. Finally, the development of Fsp3 by Frank Lovering in 2009 introduced a new principle for how molecular shape affects pharmaceutical properties and developability.

Before examining the contributions of the three nominated individuals it's worth saying something about the objectives of drug design. First, a drug needs to be highly active against its target(s). Second, activity against anti-targets should be very low (ideally too low to even be measured). Third, as I note in 34, the exposure (concentration at the site of action) of the drug needs to be controllable (one challenge in drug design is that intracellular drug concentration can’t generally be measured in vivo and I recommend that all drug discovery scientists read SR2019). I see controlling exposure as the primary focus of property-based design and one fundamental challenge is that structural modifications that lead to increased engagement potential for the therapeutic target(s) frequently result in reduced controllability of exposure as well as increased engagement potential for anti-targets. I’ve tried to capture these points in the graphic shown below.

It's generally accepted that excessive lipophilicity and molecular size are risk factors in drug design and the “compound quality” (CQ) literature abounds with fire-and-brimstone sermons on the evils of "molecular obesity" (see H2011). Nevertheless, the relationships between these descriptors and properties such as binding affinity for anti-targets, permeability, aqueous solubility and metabolic lability are generally not quite as strong as is commonly believed (or claimed). When using trends in data to inform design it’s really important to know how strong the trends are because this tells you how much weight to give to the trends when making decisions. It’s not unknown in CQ studies for trends in data to be made to appear to be stronger than they actually are which endows the CQ field with what I’ll politely call a “whiff of the pasture” (the term “correlation inflation” has been used; see KM2013). Transformation of continuous data (IC₅₀ values) to categorical data (high | medium | low) prior to analysis should trigger a deafening cacophony of alarm bells as should any averaging of groups of continuous data values without showing the spread in the data values. Some examples of studies in which I consider the strengths of trends to have been exaggerated include 29, 35, HMO2016 and HY2010.

I think that one thing that everybody who actually works (or has worked) on drug discovery projects agrees on is that drug discovery is really difficult. My view is that, by focusing on Rof5, LipE and Fsp3, the Editorial actually trivializes the challenges faced by drug discovery scientists. Most drug design (as opposed to ligand design) takes place during lead optimization and lead optimization teams are typically addressing specific problems (for example, structural changes that result in increased potency also result in reduced aqueous solubility). Lead optimization teams typically work with a lot of measured data (a significant component of drug design is efficient generation of data to enable decision-making) and a weak correlation between logP and aqueous solubility reported in the literature would be of no practical relevance when the lead optimization team is using aqueous solubility measurements for compounds in the structural series that they’re optimizing. It is common (see M2001 | G2008) for the simplicity of rules, guidelines and metrics to be touted and we noted in KM2013 that:

Given that drug discovery would appear to be anything but simple, the simplicity of a drug-likeness model could actually be taken as evidence for its irrelevance to drug discovery.

Guidelines for successful drug discovery are often presented in terms of something good (or bad) being more likely to happen when the value of a calculated property such as Fsp3 exceeds a threshold. When using guidelines like these be aware that it’s actually very difficult to set these threshold values objectively and that the guidelines would have been stated in an identical manner had different threshold values been chosen to specify them. One difficulty with using guidelines like these is that the creators of the guidelines don’t usually say what they mean by “more likely” (millions of people book flights knowing that one is “more likely” to die in a plane crash if one takes a flight than if one doesn’t take a flight). A number of published guidelines (some of which have been referenced in the Editorial) claim that compounds that comply with the guidelines are more likely to be developable. However, giving weight to these claims would require that developability be defined in an objective manner that enables compounds with arbitrary molecular structures and differing biological activity to be meaningfully compared.

I’ll examine the contributions of the three proposed laureates for the Nobel Prize in Physiology or Medicine following the order in the Editorial. Let's start with the first:

The development of the Rof5 by Christopher A. Lipinski in 1997 introduced a new principle for how molecular and physicochemical properties affect oral bioavailability. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their first nominee for the Nobel Prize for Physiology or Medicine had introduced 1997.]

My view is that the publication of the Rof5 (22) has certainly proven to be highly influential in that it made many drug discovery scientists aware of the need to take account of physicochemical properties, in particular lipophilicity, in drug design. What is less well-known, but possibly more important in my view, is that publication of the Rof5 sent a clear message to Pharma/Biotech management that high-throughput screening wasn’t going to be the panacea that many believed that it would be. However, I don't see the Rof5 as quite the epiphany that the authors of the Editorial would have us believe it to be. The quote with which I started this post was taken from an article that had been published ten years before 22 and the inverse nature of the relationship between aqueous solubility and lipophilicity was being discussed in the scientific literature (see YV1980) more than forty years ago. The NC1996 study is also worthy of mention because it was published more than a year before 22 and it makes the important point that optimal logP values are likely to vary with chemotype ("each congeneric series for a drug backbone usually demonstrates its own optimal log P").

Questions can be raised about the data analysis presented in support of the Rof5 and readers may find it helpful to take a look at the S2019 study as well as my comments on the Rof5 in HBD3 and in this post. I would argue that the Rof5 does not have any practical value as a drug design tool and I would challenge the assertion made in the Editorial that the publication of 22 demonstrated how “molecular and physicochemical properties affect oral bioavailability”. One aspect of the analysis presented (22) in support of the Rof5 that isn't always fully appreciated is that the compounds for which the descriptors are calculated were all treated as having equivalent oral bioavailability (compounds were selected for the analysis on the basis of having been taken into phase 2 clinical trials at some point before the Rof5 had been published in 1997). This is one reason that it’s not credible to assert that the analysis demonstrates that these molecular and physicochemical properties are linked to bioavailability (it must be stressed that, like many, I do actually believe that excessive lipophilicity and molecular size are risk factors in drug design). I make the following point in a blog post (I’ve modified the original text very slightly for consistency with the Editorial):

The Rof5 is stated in terms of likelihood of poor absorption or permeation although no measured oral absorption or permeability data are given in 22 and the Rof5 should therefore be regarded as a statement of belief. I realise that to make such an assertion runs the risk of an appointment with the auto-da-fé and I stress that had the Rof5 been stated in terms of physicochemical and molecular property distributions I would not have made the assertion.

To see what I was getting at let’s take a look at how the Rof5 was stated in 22 (“The ‘rule of 5’ states that: poor absorption or permeation are more likely when…”). However, the analysis presented in support of the Rof5 was of the distribution of compounds in chemical space defined by molecular weight, logP and numbers of hydrogen bond donors and acceptors with no account being taken of variation in either absorption or permeation for the compounds. Analysis like this can be informative but you need to demonstrate that the chemical space is actually relevant to the phenomena of interest. One way that you can demonstrate that a chemical space is relevant is to build predictive models for the phenomena of interest using only the dimensions of the chemical space as descriptors. Alternatively you might observe meaningful differences between the distributions in the chemical space for compounds that have respectively passed and failed at at a particular stage in clinical development.

So that’s all that I’ll be saying about Rof5 and it’s time to take a look at the contributions of the second proposed Nobel Laureate:

The development of LipE by Paul D. Leeson in 2007 introduced a new principle for how physicochemical properties impact potency, selectivity, and safety. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their second nominee for the Nobel Prize for Physiology or Medicine had introduced 2007.]

I'll start by saying that LipE is a simple mathematical formula and I suggest that one shouldn't be confusing simple mathematical formulae with principles when nominating people for Nobel Prizes. There are, however, other errors and these are not the kind of errors that you can afford to make when nominating people for Nobel Prizes. First, the term used in 29 is actually “ligand-lipophilicity efficiency” (LLE) although this appears to have mutated to “lipophilic ligand efficiency” (also LLE) by 2014 (see H2014). The term “LipE” was actually introduced by Pfizer scientists (see R2009) and it is significant that the more recent J2018 article defines LipE in terms of logD rather than logP (doing so means that you can make compounds more efficient simply by increasing extent of ionization and, as a drug design tactic, this is likely to end about as well as things did for the Sixth Army at Stalingrad).

The second (and more serious from the perspective of a Nobel nomination) error is that the metric had already been discussed, although not named, in the literature when 29 was published (I’m guessing that a suggestion that naming a metric merits a Nobel Prize for Physiology or Medicine might cause some members of the Nobel Prize committee to choke on their surströmming). The L2006 book chapter, published fifteen months before 29, states:

Thus, to achieve compounds with a not too high log P while still retaining potency, the difference between the log potency and the log D can be utilised.

From the A2007 perspective which was published three months before 29:

Lipophilicity is thought to be a driving force for binding to anti-targets such as the hERG ion channel and cytochrome p450 enzymes and potency can be scaled by lipophilicity by subtracting measured or calculated 1-octanol water partition coefficients from pIC₅₀.

It might be helpful to say something about efficiency metrics since LiPE (or LLE if you prefer) is an example of an efficiency metric. The idea behind efficiency metrics is to “normalize” a compound’s activity (typically quantified by potency or affinity) by the value of a risk factor such as lipophilicity or molecular size (for the masochists among you there’s an entire section in 34 on normalization of binding affinity). Ligand efficiency (LE) was introduced in 2004 (see H2004) and is generally regarded as the original efficiency metric although its creators do acknowledge the influence of the K1999 study. I’ve argued at length in 34 (Table 1 and Figure 1 in the article capture the essence of the argument) that LE is physically meaningless because perception of efficiency changes if you use a different concentration to define the standard state (by convention ΔG_binding values correspond to an arbitrary 1 M standard concentration) and there is no way to objectively select any particular value of the standard concentration for calculation of LE. The problem doesn’t go away if you try to define ligand efficiency in terms of logarithmically expressed values of IC₅₀, K_i or K_d instead of ΔG_binding because these quantities still have to be divided by an arbitrary concentration value in order to be expressed as logarithms (see M2011). My view is that LE shouldn't even be described as a metric and I sometimes appropriate a quote ("it's not even wrong") that is usually attributed to Pauli because those who advocate the use of LE in drug design are unable (or unwilling) to say what it measures.

The meaninglessness of LE stems from it being defined by scaling ΔG_binding by the design risk factor (molecular size). In contrast, LipE is defined by offsetting pIC₅₀ by the risk factor (logP) and can be interpreted (see 34) as the energetic cost of moving the ligand from octanol to its target binding site (this interpretation is only valid when the ligand binds in its neutral form and is predominantly neutral in the aqueous phase). When considering lipophilicity in property-based design it is important to be aware that octanol is an arbitrary choice of solvent for measurement of partition coefficients and that the logP (or logD) calculated for a compound may differ significantly depending on the algorithm used for the calculations. That said, the hydrogen bond donors/acceptors and ionizable groups tend to be relatively conserved within structural series which means that the details of exactly how lipophilicity is quantified are likely to be less critical in lead optimization than for structurally-diverse sets of compounds.

When we use LipE we’re actually assuming that logP (or logD) is predictive of properties such as aqueous solubility, affinity for anti-targets and metabolic lability. That is why it’s not accurate to state that the introduction of LipE showed how “physicochemical properties impact potency, selectivity, and safety”. In some published studies the focus is less on the LipE metric and more on what might be called the "lipophilic efficiency concept" (aim for top left corner of a plot of potency against lipophilicity). It is common to show reference lines of constant LipE to plots of potency against lipophilicity in this type of analysis and if you're doing this you really should be citing R2009 rather than 29.

I'll finish the commentary on LipE (or LLE if you prefer) with this statement made in the Editorial:

Emerging from an analysis of approved drugs, this rubric predicts a compound is more likely to be clinically developable when LipE > 5. [I don’t know what the authors of the Editorial mean by “rubric” (I'm not even sure that they do) but as a reviewer of the manuscript I would have pressed them to justify their claim. Specifically I would have been looking for a literature reference (for me, the choice of the word “emerging” does rather conjure up an image of hot gases and stoned priestesses at Delphi) and a coherent explanation for why a value of 5 yields a better rubric than values of 4 or 6.]

That’s all that I’ll be saying about LipE (or LLE if you prefer) and it’s time to take a look at the contributions of the third nominee for the Nobel Prize in Physiology or Medicine:

Finally, the development of Fsp3 by Frank Lovering in 2009 introduced a new principle for how molecular shape affects pharmaceutical properties and developability. [As a reviewer of the manuscript I would have pressed the authors to explicitly state the new principle that their third nominee for the Nobel Prize for Physiology or Medicine had introduced in 2009. My view is that Fsp3 is a thoroughly unconvincing descriptor of molecular shape and I suggest readers consider the suggestion that cyclohexane (Fsp3 = 1) would have a better shape match with benzene (Fsp3 = 0) than with either methane (Fsp3 = 1) or adamantane (Fsp3 = 1).]

[10-Aug-2025 update: The authors of the CNM2025 study claim "we repeated an analysis similar to that of Lovering et al. to assess Fsp3 in drugs approved post-2009 and those in active clinical development as of mid-2024" and conclude "there appeared to be no clear relationship between highest phase reached and Fsp3, suggesting the key conclusion noted by Lovering et al. has not persisted". My view expressed in this 05-Aug-2025 post is that the analysis is not sufficiently similar to support this conclusion.]

[04-Aug-2024 update: The Fsp3 descriptor had actually been used as i_ali in the YG2003 study (Prediction of Aqueous Solubility of Organic Compounds by Topological Descriptors) six years before the publication of 35:

The aliphatic indicator of a molecule (i_ali) is equal to the number of sp3 carbons divided by the total number of carbon atoms in the molecule.

The YG2003 study discussed prediction of aqueous solubility using i_ali (renamed as Fsp3 in 35) in conjunction with other topological descriptors. In contrast with the claims made in 35 for Fsp3 the YG2003 study made no suggestion that i_ali was a highly effective predictor of aqueous solvation when used by itself.]

Before discussing the contributions of the third nominee for the Nobel Prize for Physiology or Medicine I should stress that I certainly consider gratuitous use of aromatic rings to be a very bad thing in drug design (it was the data analysis in 35 that was criticized in KM2013 but not the eminently sensible suggestion that drug designers should look beyond what the authors referred to as ‘Flatland’). Having sp3 carbon atoms in a scaffold provides drug designers with a wider range of options for placement of substituents than would be the case for a fully aromatic scaffold and we stated in KM2013 that:

One limitation of aromatic rings as components of drug molecules is that some regions above and below the plane defined by the atomic nuclear positions are not directly accessible to substituents. Molecular recognition considerations suggest a focus on achieving axial substitution in saturated rings with minimal steric footprint, for example by exploiting the anomeric effect or by substituting N-acylated cyclic amines at C2.

My view is that deleterious effects of aromatic rings on aqueous solubility would be more plausibly explained by molecular interactions stabilizing the solid state than in terms of molecular shape (this point is discussed in more detail in HBD3). I also see saturated ring systems such as bicyclo[1.1.1]pentane and cubane as potentially more resistant to metabolism than benzene.

There’s one point that I need to make before discussing 35 from the data analysis perspective which is that molecular structures with basic nitrogen atoms tend to have higher Fsp3 values than molecular structures that lack basic nitrogen atoms (see L2013). This means that you can’t tell whether the benefits of higher Fsp3 values are actually caused by the higher Fsp3 values or by the presence of basic nitrogen atoms.

The Editorial states:

Stemming from an analysis of discovery compounds, investigational drugs, and approved drugs, Fsp3 predicts a discovery compound is more likely to become a drug when Fsp3 > 0.40. [Figure 3 in 35 does not actually depict a significant difference between mean Fsp3 values for for discovery compounds and marketed drugs (the significant difference between mean Fsp3 values is for discovery and Phase 2 compounds).]

It’s not clear (at least to me) where the figure of 0.40 comes from and I would argue that that compound X (IC₅₀ against therapeutic target = 50 μM; Fsp3 = 0.80) would actually be less likely to become a drug than compound Y (IC₅₀ against therapeutic target = 10 nM; Fsp3 = 0.20). I’m assuming that what the Editorial refers to as “analysis of discovery compounds, investigational drugs, and approved drugs” is what is shown by Figure 3 in 35. Presenting data in this manner hides the variation in Fsp3 for the compounds at each stage of development and makes the trends look much stronger than they actually are (this is verboten according to current J Med Chem author guidelines which state "If average values are reported from computational analysis, their variance must be documented.") I would challenge the suggestion that what is shown in Figure 3 in 35 can be used to calculate the probability that an arbitrary compound will become a drug (my view is that it’s not feasible to even define the probability that a compound will become a drug in a meaningful manner). Analyses of success in clinical development are generally more convincing when comparisons are made between compounds that pass or fail in individual phases of clinical development than between compounds in different phases of clinical development.

The Editorial continues:

This observation was ascribed to increased Fsp3 leading to increased aqueous solubility, a critical physiochemical property for successful drug discovery.

I’m assuming that what the Editorial refers to as “increased Fsp3 leading to increased aqueous solubility” is the trend shown by Figure 5 of 35 (this featured prominently in the KM2013 correlation inflation article) which claims to show the relationship between Fsp3 and log S (aqueous solubility expressed as a logarithm). This claim is not accurate because the log S values have been binned and the relationship is actually between centre point of bin and mean log S value for bin. The authors of 35 used public domain aqueous solubility data for their analysis and we showed (KM2013; see Figure 5) that the Pearson correlation coefficient for the relationship between log S and Fsp3 is only 0.25 (the corresponding value for the binned data is 0.97). I consider the suggestion that such a weak correlation could have any relevance whatsoever to the the likelihood of success in clinical trials to be wild and uninformed conjecture.

I'll finish my commentary on Fsp3 by reproducing this claim made in the Editorial:

Much like the Rof5 and LipE, Fsp3 has proven to be enduringly useful for the design of compounds with improved chances of clinical success. (37) [My view is there is insufficient evidence to justify this claim and I'm perplexed by the citation of 37. In any case, members of the Nobel committee are likely to focus more on whether or not Fsp3 is usefully predictive than on the endurance of this molecular descriptor.]

It’s now time to summarise what has been a long and at times pedantic blog post, and I thank all readers who’ve stayed with me. I don’t consider any of the three studies (22 | 29 | 35) that form the basis of the Nobel Prize nomination to have reported significant scientific discoveries and I would also challenge the claim made in the Editorial that these studies introduced new principles. I’m aware that 22 is heavily cited and I certainly agree that it is common to see values of LipE and Fsp3 quoted in the drug discovery literature. Nevertheless, I would argue that that the Editorial failed to provide even a single convincing example of the Rof5, LipE or Fsp3 making a critical contribution to the discovery of a marketed drug (this should be quite sufficient to rule out the award of a share in the Nobel Prize for Physiology or Medicine to any of these nominees). Furthermore, the Editorial doesn’t provide any convincing evidence that the Rof5, LipE or Fsp3 are usefully predictive in drug discovery projects.

Aside from the failure of the Editorial to demonstrate significant impact for the Rof5, LipE and Fsp3, I do have some scientific concerns about this Nobel Prize nomination. First, the Rof5 is not actually supported by data in the form that it is stated. Second, LipE had already been discussed, although not named, in the drug discovery literature when 29 was published. Third, Fsp3 had been already been introduced (as i_ali) for aqueous solubility prediction and the data analysis in 35 would fail to comply with current J Med Chem author guidelines.

A time and place for Nature in drug discovery?

2024-05-20T00:46:00.021+01:00

I’ll be reviewing Y2022 (The Time and Place for Nature in Drug Discovery) in this post and stating my position on natural products in modern drug discovery is a good place to start. I certainly see value in screening natural products and natural product-like compounds (especially in phenotypic assays) and there is currently a great deal of interest in chemical probes (I’ll point you toward an article on the Target 2035 initiative and a link to the Chemical Probes Portal). In general, a natural product or natural product-like active identified by screening would either need to exhibit novel phenotypic effects or be significantly more potent than other known actives for me to enthusiastic about following it up. I would certainly consider screening fragments that are only present in natural product structures although these would need to still need comply with the criteria (typically defined in terms of properties such as molecular size, molecular complexity and lipophilicity) used to select fragments. I see significant benefits coming from the increased use of biocatalysis, both in drug discovery and for manufacturing drugs, but I don’t see these benefits as being restricted to synthesis of natural products or natural product-like compounds.

This will be a very long post (for which I make no apology) and it's a good point to say something about how the review is presented. I've used section headings (in bold text) used in Y2022 for my commentary and quoted text has been indented (my comments on the quoted text enclosed with square brackets and italicized in red). I'd like to raise four general points before starting my review:

Proprietary data cannot accurately be described as “facts” or “evidence” and it’s not valid to claim that you’ve proven or demonstrated something on the basis of analysis of proprietary data.
If continuous data such as oral bioavailability measurements have been made categorical (e.g., high | medium | low) prior to analysis then it’s generally a safe assumption that any trends "revealed" by the analysis are weak.
If basing claims on analysis of locations or distributions within a particular chemical space it is necessary to demonstrate the chemical space is actually relevant to the claims being made. One way to do this is to build usefully predictive models of relevant quantities such as aqueous solubility or permeability using only the dimensions of the chemical space as descriptors.
There are generally many ways to partition a region of chemical space into subregions with different average values for a measured quantity. Although the boundaries resulting from these analyses typically appear to be well-defined (for example, as a line or curve in a 2-dimensional chemical space) it is a serious error to automatically interpret such boundaries as meaningful from a physicochemical perspective.

I have a number of concerns about the Y2022 article and I’ll focus on the more serious of these in this post. I’ll also be commenting on the Rule of 5 (Ro5; see L1997), logP/logD differences, and the drug discovery “sweet spot” reported in the HK2012 article. My view is that a number of the assertions and recommendations made by the authors of Y2022 are not supported by the analyses or the data that they’ve presented. Specifically, the authors present results of analyses that had been performed using proprietary and undocumented models and, in my view, they have grossly over-interpreted the predictions made using the models. At times, the authors appear to be treating natural products as if these occupy a distinct and contiguous region of chemical space (this is a pitfall into which drug-likeness advocates also frequently stumble). The authors of Y2022 discuss physicochemical properties at considerable length without making any convincing connection between this discussion and natural products. Reading the Y2022 article, I did detect a subliminal message that natural products might be infused with vital force and wouldn’t have been surprised to see Gwyneth Paltrow as a co-author.

I’ll make some general observations before examining Y2022 in detail. If you’re going to base decisions on trends in data then you need to now how strong the trends are because this tells you how much weight to give to the trends when making your decisions. In what I’ll call the ‘compound quality’ field you’ll often encounter data presentations that make it extremely difficult to see how strong (or weak) the trends in the data actually are (see KM2013: Inflation of correlation in the pursuit of drug-likeness). Since Ro5 was introduced in 1997 (see L1997) there has been a free flow of advice from self-appointed compound quality gurus as to how compounds can be made better, more developable and more beautiful (introduction of the term “Ro5 envy” in KM2013 appeared to cause some to spit feathers). This advice frequently comes in the form of dire warnings that exceeding a threshold value of a property, such as molecular weight or predicted octanol/water partition coefficient, will increase the probability of something bad happening. It’s actually very difficult to set thresholds like these objectively and you have to consider the possibility that some of these statements of probability are merely expressions of belief (to some “there is a high probability that God exists” will sound rather more convincing than “I believe in God”).

The graphical abstract is a good place to start my review of Y2022. I don’t know whether biotransformations exist that would convert the Core Scaffold into compounds that would match the Bios Collection generalized structure but a 1,3-diene in conjugation with a tertiary nitrogen is not the sort of substructure that I would want to see in a screening active that I had been charged with optimizing.

Abstract

The authors of Y2022 state:

The declining natural product-likeness of licensed drugs and the consequent physicochemical implications of this trend in the context of current practices are noted. [The authors do not make a convincing connection between natural product-likeness and physicochemical properties.] To arrest these trends, the logic of seeking new bioactive agents with enhanced natural mimicry is considered; notably that molecules constructed by proteins (enzymes) are more likely to interact with other proteins (e.g., targets and transporters), a notion validated by natural products. [I consider this claim to be extravagant and it does need to be supported by evidence. The authors’ use of “validated” reminded me of the extravagant claim made in a Future Medicinal Chemistry editorial that “ligand efficiency validated fragment-based design”. Taking the statement literally, the authors appear to be suggesting that a compound would be more likely to interact with proteins if it had been isolated from natural sources than if it had been synthesized in a laboratory (I was reminded of the "water memory" explanation for why homeopathy works). If “molecules constructed by proteins” really are more likely to interact with other proteins then they’re also more likely to interact with anti-targets like hERG and CYPs. I’m guessing that the response of medicinal chemistry teams tackling CNS targets to suggestions that they should make their compounds more like natural products so as increase the likelihood of recognition by transporters might be to ask which natural products those offering the advice had been smoking.]

Introduction

The authors show time-dependence for the values of a number of parameters calculated for drugs in Figure 1. I see analyses like these as exercises in philately and, when I first encountered examples about two decades ago, I formed a view that some senior medicinal chemists had a bit too much time on their hands. The observation of significant time-dependency for a parameter calculated for drugs can mean one of three things. First, the parameter is irrelevant to drug discovery (however, the absence of a time-dependence shouldn't be taken as evidence that the parameter is relevant to drug discovery). Second, the old ways were best and the medicinal chemists of today have lost their way (I’m guessing this might be Jacob Rees Mogg’s interpretation if he were a medicinal chemist). Third, the old ways no longer work so well and the medicinal chemists of today have learned new ways.

I have a number of concerns about what is shown in Figure 1 (quite aside from these concerns I would question why 1b or 1c were even included in the study). The data values that have been plotted are actually mean values and, as we observed in KM2013, the presentation of mean value (or median) values without showing measures of the spread in the data, such as standard deviation or inter-quartile range, makes trends look stronger than they actually are (others use the term “voodoo correlations”). This way of presenting data is specifically verboten by J Med Chem and Author Guidelines (viewed 18-May-2024) for that journal specifically state:

If average values are reported from computational analysis, their variance must be documented. This can be accomplished by providing the number of times calculations have been repeated, mean values, and standard deviations (or standard errors). Alternatively, median values and percentile ranges can be provided. Data might also be summarized in scatter plots or box plots.

However, the hidden variation in the response variables is not the only issue that I have with Figure 1. Let’s take a look at Figure 1a which shows “a temporal comparison of natural product likeness of approved drugs assessed by the Natural Product Scout algorithm (12) versus the year of the first disclosure of the drug” although it the caption for Figure 1a is “Natural product class probability. (8)”. I think that the authors do need to explain exactly what they mean by natural product class probability because the true probability that a compound is a natural product is either 1 (it’s a natural product) or 0 (it’s not a natural product). Put another way there are differences between natural products and Prof. Schrödinger’s unfortunate feline companion. The measure of lipophilicity shown in Figure 1c is XLogP3 although no justification is given for the selection of this particular method for lipophilicity prediction nor is any reference provided.

Before continuing with my review of Y2022 I also need to examine Ro5 and discuss the difference between logP and logD (the reasons for these digressions will hopefully become clear later). Ro5 which was based on physicochemical property distributions for compounds that had been taken into phase 2 of clinical development before 1997 (the year that L1997 was published). My view is that Ro5 certainly raised awareness of the problems associated with excessive lipophilicity and molecular size (A Good Thing) but I’ve never considered Ro5 to be useful in design. Although Ro5 is accepted by many (most?) drug discovery scientists as an article of faith, some are prepared to ask awkward questions and I’ll mention the S2019 study. Let’s take a look at how Ro5 was specified in the L1997 article (the graphic is slide #17 from a presentation that I gave late last year):

Ro5 is stated in terms of likelihood of poor absorption or permeation although no measured oral absorption or permeability data are given in the L1997 study and Ro5 should therefore be regarded as a statement of belief. I realise that to make such an assertion runs the risk of an appointment with the auto-da-fé and I stress that had Ro5 been stated in terms of physicochemical and molecular property distributions I would not have made the assertion.

Medieval cartographers annotated the unknown regions of their maps with “here be dragons” and Ro5’s dragons are poor absorption and poor permeation. However, there's another issue which I touched on in HBD3:

It is significant that attempts to build global models for permeability and solubility, using only the dimensions of the chemical space in which the Ro5 is specified as descriptors, do not appear to have been successful.

What I was getting at in HBD3 is that the chemical space in which Ro5 is specified was not demonstrated to be relevant to permeability or solubility (this relates to the third of the four points that I raised at the start of the post). It must be stressed that I'm definitely not denying that relationships exist between descriptors, such as logP, used to specify Ro5 and properties such as aqueous solubility and permeability that are more directly relevant to getting drugs to where they need to. It’s just that these relationships are weak (see TY2020) and, while we don’t exactly know exactly how weak the relationships are, we do know that they are weak because continuous data have been binned to display them (see also KM2013 and specifically the comments on HY2010). I would generally anticipate that these relationships will be stronger within structural series but in these cases you’ll generally observe different relationships for different structural series. In practical terms this means that a logP of 5 might be manageable in one structural series while in another structural series compounds with logP greater than 3.5 prove to be inadequately soluble. As I advised in NoLE:

Drug designers should not automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.

I also need to discuss the distinction between logP and logD since this is a source of confusion for medicinal chemists and compound quality 'experts' alike. Here’s a graphic (it’s slide #18) from the presentation that I did at SancaMedChem in 2019 (if the piranhas did venture into the non-polar phase they'd probably end up swimming backstroke):

The partition coefficient (P) is simply the ratio of the concentration of the neutral form of the compound in the organic phase (usually octanol) to the concentration of the compound in water when both phases are in equilibrium. The distribution coefficient (D) is defined analogously as the ratio of the sum of concentrations of all forms of the compound in the organic phase to the sum of concentrations of all forms of the compound in water. Values of P and D are usually quoted as their logarithms logP and logD. When interpreting logD values it is commonly assumed that that is that only neutral forms of compounds partition into organic phases and if we make this assumption the relationship between logD and logP is given by Eqn 1 (see B2017):

When we perform experiments to quantify lipophilicity it is actually logD that is measured. Values of logP and logD are identical when ionization can be neglected and logP values for ionizable compounds can be obtained by examination of measured logD-pH profiles although this is rarely done. It’s usually a safe assumption that logP values used by drug discovery scientists (and quoted in medicinal chemistry publications) have been predicted and these values vary with the method used for prediction of logP. For example, L1997 states that the upper logP limit for Ro5 is 5 when logP is calculated using the ClogP method (see L1993) but 4.15 when logP is calculated using the method of Moriguchi et al. (see M1992). Values of logD that you encounter in the literature may have been calculated or measured (you might need to dig around to see if you’re dealing with real data) and it’s also important to remember that logD depends on pH. I would argue that logD is less appropriate than logP for defining compound quality metrics because excessive lipophilicity can be countered simply by increasing the extent to which compounds are are ionized (I hope you can see why that would be A Bad Thing). Another way to think about this is to consider an amine with a pKa value of 8 bound to hERG at a pH of 7. Now suppose that you can change the pKa of the amine to 11 without changing anything else in the molecular structure. What effects would you expect this pKa change to have on affinity, on logD and on logP?

I’ll now get back to reviewing Y2022 and let’s take a look at Figure 2 which shows an adapted version of the "drug discovery sweet spot” proposed in the HK2012 study. As with Figure 1b and 1c, I would question why Figure 2 was included in the Y2022 study since the connection with natural products is tenuous. In my view the authors of the HK2012 study made a number of serious errors in their definition of the “sweet spot” and these errors have been reproduced in the Y2022 study. The authors of HK2012 claimed to have identified a “drug discovery sweet spot” in a chemical space defined by “Log P” and “Molecular mass” but they didn’t actually demonstrate that this chemical space is actually relevant to drug discovery (one way to demonstrate relevance is to build convincing global models for prediction of properties like permeability and aqueous solubility using only the dimensions of the chemical space as descriptors).

If claiming to have identified a drug discovery “sweet spot” it’s important that each dimension of the chemical space in which the “sweet spot” corresponds to a single entity. While “Molecular mass” is unambiguous the term “Log P” does not refer to the same entity for each of the data sets from which the “sweet spot” has been derived. As noted previously ClogP (see L1993) was used to specify Ro5 while the Gleeson upper Log P limit (see G2008) and the “μM potency Log P” (see G2011) were specified respectively by values of clogP (calculated logP from ACD) and AlogP (no reference provided). In contrast the Pfizer Golden Triangle (see J2009) is specified using elogD (proprietary logD prediction method for which details were ot provided). The Waring low and high logP/logD values stated in W2010 are at least partly based on analysis of AZlogD7.4 values (proprietary logD prediction method; details not provided) reported in the WJ2007 and W2009 studies. The W2010 study states that “the optimal range of lipophilicity lies between ~ 1 and 3” but the these are not the values that are depicted in Figure 3 (or indeed in the original HK2012 study). The Gleeson upper limits for Log P and Molecular Mass stated in G2008 reflect the arbitrary schemes used to bin the data and should not be regarded as objectively-determined limits for these quantities. The authors of Y2022 have superimposed ellipses for "SHMs", "Antibiotic Space?" and "bRo5 / AbbVie MPS space for higher MW" on the HK2012 "sweet spot" in the creation of Figure 2 although it is not clear how these ellipses were constructed.

The Physicochemical Characteristics of Drugs

The authors assert:

A principle advocated by Hansch that drug molecules should be made as hydrophilic as possible without loss of efficacy (47) is commonly expressed and utilized as Lipophilic Ligand Efficiency (LLE). (48) [If actually using this principle advocated by Hansch you would optimize leads by varying hydrophilicity and observing efficacy. While LLE is one way to express Hansch’s principle it is by no means the only way and (pIC₅₀ – 0.5 ´ logP) would be equally acceptable as a lipophilic ligand efficiency metric from the perspective of the Hansch’s principle.] This metric, widely accepted and exploited in drug discovery as a key metric in optimization, is expressed on a log scale as activity (e.g., −log₁₀[XC₅₀]) [The logarithm function is not defined for dimensioned quantities such as XC₅₀ (see M2011) and, while it may appear to nitpicking to point it out, this is the source of the invalidity of the ligand efficiency metric as was discussed at length in NoLE.] minus a lipophilicity term (typically the Partition coefficient or log₁₀ P or sometimes log D_7.4). (49) [Although it is common to see LLE values quoted in the drug discovery literature it’s much less clear how (or even whether) the metric was actually used to make project decisions. In many studies, however, the focus is on plots of pIC₅₀ against logP (or logD) rather than values of the metric itself. In lead optimization, medicinal chemists typically need to balance activity against properties such as permeability, aqueous solubility, metabolic stability and off-target activity. In these situations, experienced medicinal chemists typically give much more weight to structure-activity relationships (SARs) and structure-property relationships (SPRs) that they've observed within the structural series that they're optimising than to crude metrics of questionable relevance and predictivity. It is noteworthy that the authors of ref 49 use logD rather than logP to define LLE (which they call LiPE) and if you do this then you can make compounds more efficient simply by increasing the extent to which they are ionized.] The impact of lipophilicity on efficacy needs to be considered in the context that reducing lipophilicity (equating to increasing hydrophilicity) will generally increase the solubility, reduce the metabolism, and reduce the promiscuity of a given compound in a series. (50) [The relationships between these properties and lipophilicity shown in ref 50 are for structurally diverse data sets rather than for individual series. I consider the activity criterion (pIC₅₀ > 5) used to quantify promiscuity in ref 50 to be at least an order of magnitude too permissive to be pharmaceutically relevant.]

Let’s take a look at Figure 3 in which values of “Calc Chrom Log D_7.4” are plotted against “CMR”. This is what the authors of say about Figure 3 in the text of Y2022:

The distribution of marketed oral drugs in terms of their lipophilicity and size, shows a remarkably similar distribution to the set of compounds designed by Kell as a representative set of natural products to investigate carrier mechanisms (Figure 3). (64) [To state “shows a remarkably similar distribution” is arm-waving given that there are methods for assessing the similarity of two distributions in an objective manner.]

As is the case for Figure 1a, what is written in the text about Figure 3 differs significantly from the caption for this figure:

Figure 3. Natural products are found across most size lipophilicity combinations, as exemplified in a representative set designed and compiled by O’Hagan and Kell (64) superimposed on the Chrom log D_7.4 vs cmr training set of compounds with >30% bioavailability. (51) [It is unclear why this training set was restricted to compounds with >30% bioavailability. The LDF is shown in this figure with “Limits of confidence” but the level of confidence to which these limits correspond is not given.]

The first criticism that I’ll make is that the authors of Y2022 have not actually demonstrated the relevance of chemical space specified by the axes of Figure 3 (this is the essence of the third of the four points that I raised at the start of the post and the same criticism can be made of Figure 4 and Figure 5). The authors note, with some arm-waving, that cmr “largely correlates with MW” which does rather beg the question of why they consider this particular measure of molecular size to be superior to MW for this type of analysis. The authors claim that “the GSK model based on log D_7.4 vs calculated molar refraction” (it is actually molar refractivity as opposed to molar refraction that was calculated) is a useful guide to predict oral exposure. I consider this claim to be extravagant because one would need to have access to the proprietary model for calculation of Chrom Log D_7.4 in order to use the model. The proprietary nature of the GSK model means that predictions made using this model cannot credibly be presented as “evidence”.

Details of the models for calculating Chrom Log D_7.4 and for prediction of oral exposure are sketchy and I regard each of these proprietary models as undocumented. A linear discriminant function (LDF) model was reportedly used for prediction of oral exposure but it is unclear how the model was trained (or if it was even validated). An LDF is a classification model and it is not clear what how the classes were defined for prediction of oral exposure. I’m assuming that the oral absorption classes used in GSK oral exposure model have been defined by categorization of continuous data (I’m happy to be corrected on this point but, given the sketchiness of details, I can be forgiven for speculation) and setting thresholds like these is difficult to achieve in an objective manner. If this was indeed the case I'd assume that the threshold value used to categorize the continuous data was arbitrary (you’ll get a different LDF model if you use a different threshold to define the classes). My view is that that an LDF is an inappropriate way to model this type of data because the categorization of the data discards a huge amount of information.

Here's the caption for Figure 4:

Figure 4. Proposed regions of size/lipophilicity space for an oral drug set, (51) using the effectual combination of Chrom Log D_7.4vs calculated molar refraction (cmr) as a description of chemical space. [It’s actually molar refractivity as opposed to molar refractivity that was calculated. It is unclear what the authors mean by "bRo5 principles".] The highlighted regions suggest likely absorption mechanisms, based on ref (65) with compounds colored by binned NPScout probability scores. [The authors of Y2022 appear to be using a proprietary and undocumented LDF model of unknown predictivity to infer absorption mechanisms (this is what I was getting at in the fourth of the four points points that I raised at the start of the post). The depiction of data shown in Figure 4 would be much more informative had compounds known (as opposed to believed) by to be orally absorbed by one of these mechanisms been plotted in this chemical space.] Below the LDF line, then mean NPScout score is 0.45, (median 0.33) and above it (indicative of likely oral exposure) the mean is 0.31 and median 0.17 (p < 0.01) [It is unclear what (p < 0.01) refers to.]

Here's the caption for Figure 5:

Figure 5. Illustration of antibiotic drug space, expressed as Calculated Chrom Log D_7.4 vs cmr adapted from data in ref (65) colored by antibiotics (circles) and TB drugs (diamonds) which are sized by NP class probabilities and colored by prediction of likelihood of oral exposure (either side of the diagonal “linear discriminant function line” so to be oral, transporters a likely mechanism for the red colored compounds, which mostly have a high NPScout score). [As is the case for Figure 4, the authors of Y2022 appear to be using a proprietary and undocumented LDF model of unknown predictivity to infer absorption mechanisms. Stating that "mostly have a high NPScout score" is arm-waving.] Vertical (cmr < 8) and horizontal lines (Chrom Log D_7.4 < 2.5) together represent likely boundaries for paracellular absorption. [The basis (measured data or belief) for this assertion is unclear. The depiction of data shown in Figure 5 would have been more convincing had compounds known to be and known not to be absorbed by the paracellular route been plotted in this chemical space. While the problems of achieving good oral absorption for antibiotics should not be underestimated, I see getting compounds into cells as the bigger issue and in some cases the transporters cause active efflux (see R2021). The depiction of data shown in Figure 5 would have been much more informative had compounds known (as opposed to believed) to exhibit active influx and active efflux been plotted in this chemical space. Although Figure 5 is presented as a description of antibiotic drug space, the study (ref 65) on which Figure 5 is based is actually focused on antitubercular drug space (one of the challenges to discovery of antitubercular drugs is that Mycobacterium tuberculosis is an intracellular pathogen; see WL2012). One article that I recommend to all drug discovery scientists, especially those working on infectious diseases, is the SM2019 review on intracellular drug concentration.]

The authors suggest:

A logical extension of this hypothesis would be to consider recognition processes with natural molecules, which are likely to have discrete interactions with carrier proteins and therapeutic targets. [The authors do need to articulate what they mean by "discrete interactions" and why "natural molecules" are likely to have "discrete interactions" with carrier proteins and therapeutic targets.] Small molecule drugs are noted to be relatively promiscuous, so making interactions with several proteins is a likely event. (76) [This assertion is not supported by ref 76 which is actually a study of nuisance compounds, PAINS filters, and dark chemical matter in a proprietary compound collection. Promiscuity of a compound is typically defined by a count of the number of targets against which activity exceeds a specific threshold and promiscuity generally increases with the permissiveness of the activity threshold (it’s therefore meaningless to describe a compound as “promiscuous” without also stating the activity threshold). The activity threshold for the analysis reported in ref 76 is ³ 50% inhibition at a concentration of 10 µM which is appropriate if you’re worried about assay interference but, in my view, is at least an order of magnitude too permissive if considering the possibility of off-target activity for a drug in vivo.] It similarly is logical to consider that a molecule made by a recognition process in a catalytic enzyme may also interact with another protein in a similar manner. (77) [This is not quite as logical as the authors would have us believe since enzymes catalyze reactions by stabilizing transition states. A high binding affinity of an enzyme for its reaction product would generally be expected to result in inhibition of the enzyme by the reaction product.]

Natural Product Fragments in Fragment-Based Drug Discovery

The authors note:

Fragment-based drug discovery (FBDD) can be employed to rapidly explore large areas of chemical space for starting points of molecular design. (91 | 92 | 93) However, most FBDD libraries are composed of privileged substructures of known synthetic drugs and drug candidates and populate already well-explored areas of chemical space, (94 | 95 | 96) [I do not consider refs 94-96 to support this assertion (none of these three articles has a fragment screening library design focus and the most recent one was published in 2007).] often through the use of fragments with high sp2-character. (97) Underexplored areas of chemical space can be rapidly explored by employing fragments derived from NPs that are already biologically prevalidated by evolution. [The authors appear to be suggesting that the physiological effects of natural products are more due to the fragments from which they have been constructed than of the way in which the fragments have been combined.]

Molecular recognition

The authors state:

That the embedded recognition of natural products for proteins correlates with recognition of the biosynthetic enzyme is an increasingly validated concept. (118 | 119 | 120) [I have no idea what “embedded recognition” means and I’m guessing that the authors might be in a similar position.] The biosynthetic imprint translates to recognition of other proteins using similar interactions. [As I’ve already noted, high binding affinity of a natural product for the enzyme that catalysed its formation would lead to inhibition of the enzyme.] For example, the analysis of protein structures of 38 biosynthetic enzymes gave 64 potential targets for 25 natural products. (121) [Concepts are usually validated with measured data and not by making predictions.]

Conclusions and Prospects for Future Development

The authors assert:

More natural molecules will increase quality through their inherently improved permeability and solubility; [At the risk of appearing pedantic, permeability and solubility are properties of compounds as opposed to molecules. That said, the authors appear to be treating “natural molecules” as occupying a distinct and contiguous region of chemical space by making this claim and it is unclear what the improvements will be relative to. The authors do not present any measured data for permeability or solubility to support their claim.] this is a case of investing time and effort in the early stages of drug discovery to reap rewards with improvements in the later stages through more predictability in trials (and thus a greater chance of success, where quality rather than speed demonstrably impacts (170)) [Many, including me, do indeed believe that investing time and effort in the early stages of drug discovery increases the chances of success in the later stages. However, I would challenge the assertion by the authors of Y2022 that ref 170 actually demonstrates this.] and more sustainable manufacturing methods driven by the transformative power of biocatalysis. (171)

So that concludes my review of Y2022 and thanks for staying with me. I'll leave you with a selfie here in Trinidad's Maraval Valley with my faithful canine companions BB and Coco providing much-needed leadership (a few minutes earlier I had patiently explained to them why ligand efficiency is complete bollocks).

Standard states and solution thermodynamics

2024-04-01T07:46:00.005+01:00

<< previous || next >>

Readers of this blog know that, on more than one occasion, I have denounced the ligand efficiency metric as physically meaningless on the grounds that perception of efficiency varies with the concentration value that defines the standard state. As I argue in NoLE this is clearly thermodynamic nonsense (Pauli might even have suggested that it wasn’t even wrong) and the equivalent cheminformatic argument is that perception shouldn’t change when you use a different unit to express a quantity.

A change in perception resulting from using a different standard concentration can also be a problem when analysing thermodynamic signatures. One particular absurdity is that binding can be switched from enthalpy-driven to entropy-driven simply by using a different concentration to define the standard state. This statement in the W2014 article unintentionally highlights the issue:

Consequently, we define the dimensionless ratio (ΔH + TΔS)/ΔG as the Enthalpy–Entropy Index (I_E–E) and use it here to indicate the enthalpy content of binding. Its advantageous feature is that it is normalised by the free energy ΔG (= ΔH – TΔS), and so it can be used to compare compounds with millimolar to nanomolar binding affinities during the course of a hit-to-lead optimisation.

I do indeed think that it makes a lot of sense to use (ΔH + TΔS) and ΔG as parameters for exploring thermodynamic signatures. However, the dimensionless ratio of the two quantities is physically meaningless because of its dependence on the concentration used to define the standard state (this dependence stems from the fact that ΔS depends on the standard concentration while ΔH is invariant to change in the standard concentration).

One article that I’ve been particularly critical of in the past is “The role of ligand efficiency metrics in drug discovery” NRDD 133:105-121 (2014) DOI. Specifically, I have expressed concerns about this sentence in Box 1 (Ligand efficiency metrics) of the article:

Assuming standard conditions of aqueous solution at 300K, neutral pH and remaining concentrations of 1M, –2.303RTlog(K_d/C°) approximates to –1.37 × log(K_d) kcal/mol.

I do need to mention a potential source of confusion when analysing K_d values. In biochemistry, biophysics and drug discovery K_d values are conventionally quoted as dimensioned quantities in units of concentration. However, K_d values may also be quoted as dimensionless ratios and, in these cases, the K_d value depends on the concentration used to define the standard state. There seems to be an error in that the approximation appears to eliminate the dimensions of the standard concentration C°.

I should say that I’ve always been a bit nervous about denouncing the approximation as an error because the authors are all renowned thought leaders in the drug discovery field. Furthermore, the journal impact factor of NRDD is a significant multiple of my underwhelming h-index and any error of such apparent grossness would surely have been detected during the rigorous peer review process applied by this elite journal. It turns out that my nervousness was indeed well placed and, when calculated at 300 K, the product RT actually serves as an annihilation operator that eliminates the dimensionality associated with K_d. This also explains why a temperature of 300 K must be used when calculating the ligand efficiency even though biochemical assays are usually run at human body temperature (310 K).

I became convinced of the validity of the above approximation recently after examining a manuscript by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, who is currently on secondment to the Budapest Enthalpomics Group (BEG). The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (she decamped last year from Tel Aviv to Uzbekistan and, to Derek’s likely disapproval, is currently running an open access journal out of a van in Samarkand). There is no doubt that this genuinely disruptive study will comprehensively reshape the generative AI landscape, enabling drug discovery scientists, for the very first time, to rationally design novel clinical candidates using only gene sequences as input.

Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that it is indeed possible to eliminate the need to define standard states for the thermodynamic analysis of liquid solutions, provided that the appropriate temperature is used. The math is truly formidable (my rudimentary understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barone-Samedi transformation, followed by hepatic eigenvalue extraction using the algorithm introduced by E. V. Tooms (a reclusive Baltimore resident better known for his research in analytic topology). ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which a beaming BEG director Prof. Kígyó Olaj explains that, “possibilities are limitless now that we have eliminated the standard state from solution thermodynamics and thereby consigned the tedious and needlessly restrictive Second Law to the dustbin of history."

Leadeth me unto Truth and delivereth me from those who have already found it

2024-03-27T06:02:00.026+00:00

A theory has only the alternative of being true or false.

A model has a third possibility: it may be true, but irrelevant.

With apologies to Manfred Eigen (1927 - 2019)

******************

[This post was updated on 25-Jun-2024]

I've just returned to Cheshire from the Caribbean and, to kick off blogging from 2024 I'll share a photo of the orchids at Berwick-on-Sea on the north coast of Trinidad.

Encountering words like “truth” and “beauty” (here's a good example) in the titles of scientific articles always sets off warning bells for me and I’ll kick off blogging for 2024 with a look at FM2024 (Structure is beauty, but not always truth) that was recently published in Cell (and has already been reviewed by Derek). The authors have highlighted important issues: we typically use single conformations of targets in design and the experimentally-determined structures used for design may differ substantially from the structures of targets as they exist in vivo. These points do need be stressed given the expanding range of modalities being exploited by drug designers and the increasing use of AI/ML in drug design. That said, it’s my view that the authors have allowed themselves to become prisoners of their article’s title. Specifically, I see “beauty” as a complete red herring and suggest that it would have been much better to have discussed structure in terms of accuracy and relevance rather than truth. Here’s the abstract for FM2024:

Structural biology, as powerful as it is, can be misleading. We highlight four fundamental challenges: interpreting raw experimental data; accounting for motion; addressing the misleading nature of in vitro structures; and unraveling interactions between drugs and “anti-targets.” Overcoming these challenges will amplify the impact of structural biology on drug discovery.

I'll start by taking a look at the introduction and my view is that the authors do need to be much clearer about what they mean by “this hydrogen bond is better than that one” when using terms like “ground truth”. For example, we can infer that the geometry of one target-ligand hydrogen bond is closer to optimal than the geometry of another target-ligand hydrogen bond. However, the energetic cost of breaking a target-ligand hydrogen bond is not something that can generally be measured and, as noted in NoLE, the contribution of an intermolecular contact to affinity is not actually an experimental observable. Ligands associate with their targets (and anti-targets) in aqueous media and this means that intermolecular contacts, for example between polar and non-polar atoms, can destabilize the target-ligand complex without being inherently repulsive. What I’m getting at here is that structures of ligand-target complexes are relatively simple and well-defined entities within the broader context of drug discovery and yet it doesn’t appear useful to discuss them in terms of truth.

The remainder of the post follows the FM2024 section headings.

A structure is a model, not experimental reality

The term “structure” can have a number of different meanings in structure-based drug design. First, drug targets (and anti-targets) have structures that exist regardless of whether they have been experimentally determined. Second, models are built for drug targets by fitting nuclear coordinates to experimental data such as electron density (these are often referred to as experimental structures although they should strictly be called models because they are abstractions of the experimental data). Third, the structure could have been predicted using computational tools such as AlphaFold2 (here's an article, cited by FM2024, on why we still need experimentally-determined structures).

In the abstract the authors identify “interpreting raw experimental data” as one of “four fundamental challenges”. However, the actual focus of this section appears to be evaluation of predicted structures rather than interpretation of raw experimental data. While I’m sure that we can find better ways to interpret raw experimental data, and indeed to evaluate predicted structures, I don’t see either as representing a fundamental challenge.

Representing wiggling and jiggling is hard

My view is that it’s actually the ensemble of conformations rather than the wiggling and jiggling that we actually need to represent. Simulation of the wiggling and jiggling is one way to generate an ensemble of conformations but it’s not the only way (nor is it necessarily the best way). That said, it's a lot easier to sell protein motion to venture capitalists than it is to sell ensembles of conformations.

The authors state:

Analogous to how structure-based drug design is great for optimizing “surface complementarity” and electrostatics, future protein modeling approaches will unlock ensemble-based drug design with an ability to predictably tune new and important aspects of design, including entropic contributions [7] and residence times [8] of bound ligands.

The term “entropic contributions” does come across as arm-waving (especially in a drug design context) and my view is that entropy should be seen as an effect rather than a cause. Thermodynamic signatures for binding are certainly of scientific interest but I would argue that they are essentially irrelevant to drug design (it can be instructive to consider how patients might sense the benefits of enthalpically-driven drug binding). The case for increasing residence time might not be quite as solid as many believe it to be (see the F2018 study and this blog post).

In vitro can be deceiving

The authors identify “addressing the misleading nature of in vitro structures” as a fundamental challenge and they state:

While purifying a protein out of its cellular context can be enabling for in vitro drug discovery, it can also provide a false impression. Recombinant expression can lead to missing post-translational modifications (e.g., phosphorylation or glycosylation) that are critical to understanding the function of a protein.

To this I’d add that we often don’t use the full-length proteins in design and recombinant proteins may have been engineered to make them easier to crystallize or more robust for soaking experiments. Furthermore, target engagement may require the drug to interact with two or more proteins (see HC2017) which will probably be more amenable individually to structure determination than the their complex. I fully agree that it is important for drug designers to be aware that the experimentally-determined structures that they're using differ from the structures of the targets as they exist in vivo. However, I don't believe that it makes any sense to talk about “the misleading nature of in vitro structures” (or indeed about “in vitro drug discovery”) because target structures are never experimentally determined in vivo and are only misleading to the extent that users overinterpret them. As a more general point users of experimental data do need to very careful about describing the experimental data that they’re using as “misleading” or "deceiving".

When we use structures to represent targets the issue is much less about the truth of the structures that we’re using and much more about their relevance to the targets that we’re trying to represent. This is not just an issue for structural biology and we might, for example, use the catalytic domain of an enzyme as a model for the full-length protein when running biochemical assays. We have to make assumptions in these situations and we also need to check that these assumptions are reasonable. For example, we might examine the structure-activity relationship in a cell-based assay for consistency with the structure-activity relationship that we’ve observed in the enzyme inhibition assay. It's also worth pointing out that what we observe in cells is usually a coarse approximation to what actually happens in vivo and we can't even measure the intracellular concentration of a drug in vivo.

Drugs mingle with many different receptors

Drugs do indeed mingle with many receptors in vivo but it’s important to be aware that the consequences of this mingling depend on the drug concentration (a spatiotemporal quantity) at the site of action. Drug discovery scientists use the term exposure when talking about drug concentration at the site of action and one underappreciated challenge in drug design is that intracellular drug concentration cannot generally be measured in vivo (here’s an open access article that I recommend to everybody working drug discovery). I argue in NoLE that controllability of exposure should be seen as a drug design objective although the current impossibility of measuring intracellular concentration means that we can only assess how effectively the objective has been achieved in an indirect manner. Alternatively, drug design can be seen in terms of minimization of the dose at which therapeutically beneficial effects can be observed.

One assumption often made in drug design is that the drug concentration at the site of action is equal to the unbound concentration in plasma and this assumption is referred to as the free drug hypothesis (FDH) although the term “free drug theory” is also used. The basis for the FDH is the assumption that the drug can move freely between plasma and the target compartment. In reality the drug concentration at the site of action will generally lag behind its unbound plasma concentration and the lag time is inversely related to the ease with which the drug permeates through the barriers which separate the target from the plasma. There are a couple of scenarios under which you can’t assume that the drug concentration in the target compartment will be the same as its unbound plasma concentration. The first of these is when active transport is significant and this is a scenario with which drug designers tackling targets within the central nervous system (CNS) are familiar with. The second scenario is that there is an ionizable functional group (as is the case for amines) in the molecular structure of the drug and the pH at the site of action differs significantly from plasma pH (as is the case for lysosomes).

There are two general types of undesirable outcome that can result when a drug encounters receptors with which it mingles. First, the receptor is an anti-target and the encounter results in binding of the drug, leading to toxicity (patients are harmed). Second, the receptor is a metabolic enzyme or a transporter and the encounter leads to the drug either being turned over or pumped from where it needs to function (patients do not benefit from the treatment).

I've inserted some comments (italicised in red) into the following quoted text:

The sad reality that all drug discoverers must face is that however well designed we may believe our compounds to be, they will find ways to interact with many other proteins or nucleic acids in the body and interfere with the normal functions of those biomolecules. While occasionally, the ability of a medicine to bind to multiple biomolecules will increase a drug’s efficacy, such polypharmacology is far more likely to produce undesirable effects. These undesirable outcomes take two forms. Obviously, the direct binding to an anti-target can lead to a bewildering range of toxicities, many of which render the drug too hazardous for any use. [While there are well-known anti-targets such as hERG that must be avoided, my understanding is that those responsible for drug safety generally prefer not to see any off-target activity given the difficulties in prediction of toxicity. Here are a couple of relevant articles (B2012 | J2020) and a link to some information about in vitro safety pharmacology profiling panels from Eurofins. Update 25-Jun-2024: recent review on secondary pharmacology.] More subtly, the binding to anti-targets reduces the ability of the drug to reach the desired target. A drug that largely avoids binding to anti-targets will partition more effectively through the body, enabling it to accumulate at high enough concentrations in the disease-relevant tissue to effectively modulate the function of the target. [I consider it unlikely that binding to an anti-target could account for a significant proportion of the dose. In any case, I’d expect binding of a drug to anti-targets to cause unacceptable toxicity long before it results in sequestration of a significant proportion of the dose.]

A particular challenge results from the interaction of drugs with the enzymes, transporters, channels, and receptors that are largely responsible for controlling the metabolism and pharmacokinetic properties (DMPK) of those drugs—their absorption, distribution, metabolism, and elimination. Drugs often bind to plasma proteins, preventing them from reaching the intended tissues; [A degree of binding to plasma proteins is not a problem and, in the case of warfarin, is probably essential for the safe use of the drug.] they can block or be substrates for all manner of pumps and transporters, changing their distribution through the body; [Transporters can indeed prevent drugs from getting to their sites of action at therapeutically effective concentrations and limited brain exposure resulting from active efflux is a common issue for CNS drug discovery programs (see H2012 and R2015). I am not aware of any transporters that are definitely considered to be anti-targets from the safety perspective (I'm happy to be corrected on this point) and inhibition of efflux pumps is a recognized tactic (see T2021 and H2020) in drug discovery. Update 25-Jun-2024: I thank Mohamed Diwan M. AbdulHameed (google scholar profile) for making me aware that inhibition of bile salt export pump (BESP) is considered a risk factor for drug-induced liver injury (DILI). Here's a relevant article.] xenobiotic sensors such as PXR that turn on transcriptional programs recognizing foreign substances; and they often block enzymes like cytochrome P450s, thereby changing their own metabolism and that of other medicines. [Inhibition of CYPs is generally considered undesirable from the safety perspective because of the potential for drug-drug interactions (see H2020). That said, the CYP3A inhibitor ritonavir (see CG2003) is used in the COVID-19 treatment Paxlovid to slow metabolism of SARS-CoV-2 main protease nirmatrelvir.] They are themselves substrates for P450s and other metabolizing enzymes and, once altered, can no longer carry out their assigned, life-saving function. [Medicinal chemists are well aware of the challenges presented by drug-metabolizing enzymes although it must be stressed that any drug that was cleared too slowly would be considered to be an unacceptable safety risk.]

Taken together, we refer to these DMPK-related proteins, somewhat tongue-in-cheek, as the “avoidome” (Figure 2). [It is unclear why the authors have chosen to only include DMPK-related proteins in the avoidome (hERG is not a DMPK-related protein but is an anti-target that every drug discovery scientist would wish to avoid blocking). For reasons outlined in the previous paragraph I would actually argue against the inclusion of DMPK-related proteins in the avoidome.] Unfortunately, the structures of the vast majority of avoidome targets have not yet been determined. Further, many of these proteins are complex machines that contain multiple domains and exhibit considerable structural dynamism. Their binding pockets can be quite large and promiscuous, favoring distinct binding modes for even closely related compounds. [It is not clear whether this assertion is based on experimental observations.] As a consequence, multiple structures spanning a range of bound ligands and protein conformational states will be required to fully understand how best to prevent drugs from engaging these problematic anti-targets.

We believe the structural biology community should “embrace the avoidome” with the same enthusiasm that structure-based design has been applied to intended targets. [My view is that the authors need to clearly articulate their reasons for only including DMPK-related proteins in the avoidome before seeking to direct the activities of structural biology community. I presume that the Target 2035 initiative, which aims to “to create by year 2035 chemogenomic libraries, chemical probes, and/or biological probes for the entire human proteome”, will also cover anti-targets. Having chemical and/or biological probes available for anti-targets should lead to better understanding of toxicity in humans.] The structures of these proteins will shed considerable light on human biology and represent exciting opportunities to demonstrate the power of cutting-edge structural techniques. [Experimental structures of target-ligand complexes do indeed provide valuable direct evidence that a ligand is binding to a protein but the structures themselves are not particularly informative from the perspective of understanding human biology. It is actually high-quality chemical probes that are needed to shed light on human biology and here’s a link to the Chemical Probes Portal. Structures at atomic resolution for protein-ligand complexes are certainly useful for chemical probe design but are not strictly necessary for effective use of chemical probes.] Crucially, a detailed understanding of the ways that drugs engage with avoidome targets would significantly expedite drug discovery. [Experimentally-determined structures of anti-targets complexed with ligands are certainly informative when elucidating structure-activity relationships for binding to anti-targets. However, structural information of this nature is much less directly useful for addressing problems such as metabolic lability and active efflux.] This information holds the potential to achieve a profound impact on the discovery of new and enhanced medicines.

Conclusion

The authors assert:

In drug discovery, truth is a molecule that transforms the practice of medicine. [I disagree with this assertion. In drug discovery truth may also be a compound that, despite an excellent pharmacokinetic profile, chokes comprehensively in phase 2.]

It's been been a long post and this is a good place to leave things. While the authors have raised some valid points I found the 'Drugs mingle with many different receptors' section to be rather confused and I don't think that the drug discovery and structural biology communities are in desperate need of yet another 'ome' word. I hope that my review of FM2024 will be useful for readers of the article while providing helpful feedback for the authors and for the Editors of Cell.

Chemical con artists foil drug discovery

2023-12-31T17:20:00.015+00:00

One piece of general advice that I offer to fellow scientists is to not let the fact that an article has been published in Nature (or any other ‘elite’ journal for that matter) cause you to switch off your critical thinking skills while reading it and the BW2014 article (Chemistry: Chemical con artists foil drug discovery) that I’ll be reviewing in this post is an excellent case in point. My main criticism of BW2014 that is that the rhetoric is not supported by data and I’ve always seen the article as something of a propaganda piece.

One observation that I’ll make before starting my review of BW2014 is that what lawyers would call ‘standard of proof’ varies according to whether you’re saying something good about a compound or something bad. For example, I would expect a competent peer reviewer to insist on measured IC50 values if I had described compounds as inhibitors of an enzyme in a manuscript. However, it appears to be acceptable, even in top journals, to describe compounds as PAINS without having to provide any experimental evidence that they actually exhibit some type of nuisance behavior (let alone pan-assay interference). I see a tendency in the ‘compound quality’ field for opinions to be stated as facts and reading some of the relevant literature leaves me with the impression that some in the field have lost the ability to distinguish what they know from what they believe.

BW2014 has been heavily cited in the drug discovery literature (it was cited as the first reference in the ACS assay interference editorial which I reviewed in K2017) despite providing little in the way of practical advice for dealing with nuisance behavior. B2014 appears to exert a particularly strong influence on the Chemical Probes Community having been cited by the A2015, BW2017, AW2022 and A2022 articles as well as in the Toxicophores and PAINS Alerts section of the Chemical Probes Portal. Given the commitment of the Chemical Probes Community to open science, their enthusiasm for the PAINS substructure model introduced in BH2010 (New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays) is somewhat perplexing since neither the assay data nor the associated chemical structures were disclosed. My advice to the Chemical Probes Community is to let go of PAINS filters.

Before discussing BW2014, I’ll say a bit about high-throughput screening (HTS) which emerged three decades ago as a lead discovery paradigm. From the early days of HTS it was clear, at least to those who were analyzing the output from the screens, that not every hit smelt of roses. Here’s what I wrote in K2017:

Although poor physicochemical properties were partially blamed (3) for the unattractive nature and promiscuous behavior of many HTS hits, it was also recognized that some of the problems were likely to be due to the presence of particular substructures in the molecular structures of offending compounds. In particular, medicinal chemists working up HTS results became wary of compounds whose molecular structures suggested reactivity, instability, accessible redox chemistry or strong absorption in the visible spectrum as well as solutions that were brightly colored. While it has always been relatively easy to opine that a molecular structure ‘looks ugly’, it is much more difficult to demonstrate that a compound is actually behaving badly in an assay.

It has long been recognized that it is prudent to treat frequent-hitters (compounds that hit in multiple assays) with caution when analysing HTS output. In K2017 I discussed two general types of behavior that can cause compounds to hit in multiple assays: Type 1 (assay result gives an incorrect indication of the extent to which the compound affects target function) and Type 2 (compound acts on target by undesirable mechanism of action (MoA)). Type 1 behavior is typically the result of interference with the assay read-out and the hits in question can be accurately described as ‘false positives’ because the effects on the target are not real. Type 1 behaviour should be regarded as a problem with the assay (rather than with the compound) and, provided that the activity of a compound has been established using a read-out for which interference is not a problem, interference with other read-outs is irrelevant. In contrast, Type 2 behavior should be regarded as a problem with the compound (rather than with the assay) and an undesirable MoA should always be a show-stopper.

Interference with read-out and undesirable MoAs can both cause compounds to hit in multiple assays. However, these two types of bad behavior can still cause big problems whether or not the compounds are observed to be frequent-hitters. Interference with read-out and undesirable MoAs are very different problems in drug discovery and the failure to recognize this point is a serious deficiency that is shared by BW2014 and BH2010.

Although I’ve criticized the use of PAINS filters there is no suggestion that compounds matching PAINS substructures are necessarily benign (many of the PAINS substructures look distinctly unwholesome to me). I have no problem whatsoever with people expressing opinions as to the suitability of compounds for screening provided that the opinions are not presented as facts. In my view the chemical con-artistry of PAINS filters is not that benign compounds have been denounced but the implication that PAINS filters are based on relevant experimental data.

Given that the PAINS filters form the basis of a cheminformatic model that is touted for prediction of pan-assay interference, one could be forgiven for thinking that the model had been trained using experimental observations of pan-assay interference. This is not so, however, and the data that form the basis of the PAINS filter model actually consist of the output of six assays that each use the AlphaScreen read-out. As noted in K2017, a panel of six assays using the same read-out would appear to be a suboptimal design of an experiment to observe pan assay interference. Putting this in perspective, P2006 (An Empirical Process for the Design of High-Throughput Screening Deck Filters) which was based on analysis of the output from 362 assays had actually been published four years before BH2010.

After a bit of a preamble, I need to get back to reviewing BW2014 and my view is that readers of the article who didn’t know better could easily conclude that drug discovery scientists were completely unaware of the problems associated with misleading HTS assay results before the re-branding of frequent-hittters as PAINS in BH2010. Given that M2003 had been published over a decade previously. I was rather surprised that BW2014 had not cited a single article about how colloidal aggregation can foil drug discovery. Furthermore, it had been known (see FS2006) for years before the publication of BH2010 that the importance of colloidal aggregation could be assessed by running assays in the presence of detergent.

I'll be commenting directly on the text of BW2014 for the remainder of the post (my comments are italicized in red).

Most PAINS function as reactive chemicals rather than discriminating drugs. [It is unclear here whether “PAINS” refers to compounds that have been shown by experiment to exhibit pan-assay interference or simply compounds that share structural features with compounds (chemical structures not disclosed) claimed to be frequent-hitters in the BH2010 assay panel. In any case, sweeping generalizations like this do need to be backed with evidence. I do not consider it valid to present observations of frequent-hitter behavior as evidence that compounds are functioning as reactive chemicals in assays.] They give false readouts in a variety of ways. Some are fluorescent or strongly coloured. In certain assays, they give a positive signal even when no protein is present. [The BW2014 authors appear to be confusing physical phenomena such as fluorescence with chemical reactivity.]

Some of the compounds that should ring the most warning bells are toxoflavin and polyhydroxylated natural phytochemicals such as curcumin, EGCG (epigallocatechin gallate), genistein and resveratrol. These, their analogues and similar natural products persist in being followed up as drug leads and used as ‘positive’ controls even though their promiscuous actions are well-documented (8,9). [Toxoflavin is not mentioned in either Ref8 or Ref9 although T2004 would have been a relevant reference for this compound. Ref8 only discusses curcumin and I do not consider that the article documents the promiscuous actions of this compound. Proper documentation of the promiscuity of a compound would require details of the targets that were hit, the targets that were not hit and the concentration(s) at which the compound was assayed. The effects of curcumin, EGCG (epigallocatechin gallate), genistein and resveratrol on four membrane proteins were reported in Ref9 and these effects would raise doubts about activity for any of these compounds (or their close structural analogs) that had been observed in a cell-based assay. However, I don’t consider that it would be valid to use the results given in Ref9 to cast doubt on biological activity measured in an assay that was not cell-based.]

Rhodanines exemplify the extent of the problem. [Rhodanines are specifically discussed in K2017 in which I suggest that the most plausible explanation for the frequent-hitter behavior observed for rhodanines in the BH2010 panel of six AlphaScreen assays is that the singly-connected sulfur reacts with singlet oxygen (this reactivity has been reported for compounds with thiocarbonyl groups in their molecular structures).] A literature search reveals 2,132 rhodanines reported as having biological activity in 410 papers, from some 290 organizations of which only 24 are commercial companies. [Consider what the literature search would have revealed if the target substructure had been ‘benzene ring’ rather than ‘rhodanine’? As discussed in this post the B2023 study presented the diversity of targets hit by compounds incorporating a fused tetrahydroquinolines in their molecular structures as ‘evidence’ for pan-assay interference by compounds based on this scaffold.] The academic publications generally paint rhodanines as promising for therapeutic development. In a rare example of good practice, one of these publications (10) (by the drug company Bristol-Myers Squibb) warns researchers that these types of compound undergo light-induced reactions that irreversibly modify proteins. [The C2001 study (Photochemically enhanced binding of small molecules to the tumor necrosis factor receptor-1 inhibits the binding of TNF-α) is actually a more relevant reference since it focuses of the nature of the photochemically enhanced binding. The structure of the complex of TNFRc1 with one of the compounds studied (IV703; see graphic below) showed a covalent bond between one of carbon atoms of the pendant nitrophenyl and the backbone amide nitrogen of A62. The structure of the IV703–TNFRc1 complex shows that a covalent bond between pendant aromatic ring must also be considered as a distinct possiblity for the rhodanines reported in Ref10 and C2001.] It is hard to imagine how such a mechanism could be optimized to produce a drug or tool. Yet this paper is almost never cited by publications that assume that rhodanines are behaving in a drug-like manner. [It would be prudent to cite M2012 (Privileged Scaffolds or Promiscuous Binders: A Comparative Study on Rhodanines and Related Heterocycles in Medicinal Chemistry) if denouncing fellow drug discovery scientists for failure to cite Ref10.]

In a move partially implemented to help editors and manuscript reviewers to rid the literature of PAINS (among other things), the Journal of Medicinal Chemistry encourages the inclusion of computer-readable molecular structures in the supporting information of submitted manuscripts, easing the use of automated filters to identify compounds’ liabilities. [I would be extremely surprised if ridding the literature of PAINS was considered by the JMC Editors when they decided to implement a requirement that authors include computer-readable molecular structures in the supporting information of submitted manuscripts. In any case, claims such as this do need to be supported by evidence.] We encourage other journals to do the same. We also suggest that authors who have reported PAINS as potential tool compounds follow up their original reports with studies confirming the subversive action of these molecules. [I’ve always found this statement bizarre since the BW2014 authors appear to be suggesting that that authors who have reported PAINS as potential tool compounds should confirm something that they have not observed and which may not even have occurred. When using the term “PAINS” do the BW2014 authors mean compounds that have actually been shown to exhibit pan-assay interference or compounds that that share structural features with compounds that were claimed to exhibit frequent-hitter behavior in the BH2010 assay panel? Would interference in with the AlphaScreen read-out by a singlet oxygen quencher be regarded as a subversive action by a molecule in situations when a read-out other than AlphaScreen had been used?] Labelling these compounds clearly should decrease futile attempts to optimize them and discourage chemical vendors from selling them to biologists as valid tools. [The real problem here is compounds being sold as tools in the absence of the measured data that is needed to support the use of the compounds for this purpose. Matches with PAINS substructures would not rule out the use of a compound as a tool if the appropriate package of measured data is available. In contrast, a compound that does not match any PAINS substructures cannot be regarded as an acceptable tool if the appropriate package of measured data is not available. Put more bluntly, you’re hardly going to be able to generate the package of measured data if the compound is as bad as PAINS filter advocates say it is.]

Box: PAINS-proof drug discovery

Check the literature. [It’s always a good idea to check the literature but the failure of the BW2014 authors to cite a single colloidal aggregation article such as M2003 suggests that perhaps they should be following this advice rather than giving it. My view is that the literature on scavenging and quenching of singlet oxygen was treated in a cursory manner in BH2010 (see earlier comment in connection with rhodanines).] Search by both chemical similarity and substructure to see if a hit interacts with unrelated proteins or has been implicated in non-drug-like mechanisms. [Chemical similarity and substructure search will identify analogs of hits and it is actually the exact match structural search that you need do in order to see if a particular compound is a hit in assays against unrelated proteins.] Online services such as SciFinder, Reaxys, BadApple or PubChem can assist in the check for compounds (or classes of compound) that are notorious for interfering with assays. [I generally recommend ChEMBL as a source of bioactivity data.]

Assess assays. For each hit, conduct at least one assay that detects activity with a different readout. [This will only detect problems associated with interference with read-out. As discussed in S2009 it may be possible to assess and even correct for interference with read-out without having to run an assay with a different read-out.] Be wary of compounds that do not show activity in both assays. If possible, assess binding directly, with a technique such as surface plasmon resonance. [SPR can also provide information about MoA since association, dissociation and stoichiometry can all be observed directly using this detection technology.]

That concludes blogging for 2023 and many thanks to anybody who has read any of the posts this year. For too many people Planet Earth is not a very nice place to be right now and my new year wish is for a kinder, happier and more peaceful world in 2024.

On quality criteria for covalent and degrader probes

2023-12-19T10:02:00.007+00:00

I’ll be taking a look at H2023 (Expanding Chemical Probe Space: Quality Criteria for Covalent and Degrader Probes) in this post and this article has also been discussed In The Pipeline. I’ll primarily be discussing the quality criteria for covalent probes in this post although I’ll also comment briefly on chemical matter criteria proposed for degrader probes. The post is intended as a contribution to the important scientific discussion that the H2023 Perspective is intended to jumpstart:

We are convinced that now is the time to initiate similar efforts to achieve a consensus about quality criteria for covalently acting and degrader probes. This Perspective is intended to jumpstart this important scientific discussion.

Covalent bond formation between ligands and targets is a drug design tactic for exploiting molecular recognition elements in targets that are difficult to make beneficial contacts with. Cysteine SH has minimal capacity to form hydrogen bonds with polar ligand atoms and the exposed nature of catalytic cysteine SH reduces its potential to make beneficial contacts with non-polar ligand atoms. One common misconception in drug discovery is that covalent bond formation between targets and ligands is necessarily irreversible and it wasn’t clear from my reading of H2023 whether the authors were aware that covalent bond formation between targets and ligands can also be reversible. In any case, it needed to be made clear that the quality criteria proposed by the authors for covalently acting small-molecule probes only apply to probes that act irreversibly.

Irreversible covalent bond formation is typically used to target non-catalytic residues and design is lot more complicated than for reversible covalent bond formation. First, IC50 values are time-dependent (there are two activity parameters: affinity and inactivation rate constant) which makes it much more difficult to assess selectivity or to elucidate SAR. Second, the transition state structural models required for modelling inactivation cannot be determined experimentally and therefore need to be calculated using computationally intensive quantum mechanical methods.

I’ll start my review with a couple of general comments. Intracellular concentration is factor that is not always fully appreciated in chemical biology and I generally recommend that people writing about chemical probes demonstrate awareness of SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise). One a more pedantic note I cautioned against using ‘molecule’ as a synonym for ‘compound’ in my review of S2023 (Systematic literature review reveals suboptimal use of chemical probes in cell-based biomedical research) and I suggest that “covalent molecule” might be something that you don't want to see in the text of an article in a chemistry journal.

However, significant efforts need to be invested into characterizing and validating covalent molecules as a prerequisite for conclusive use in biomedical research and target validation studies.

The proposed quality criteria for covalently acting small-molecule probes are given in Figure 2 of H2023 although I’ll be commenting on the text of the article. Subscripting doesn't work well in blogger and so I'll use K.i and k.inact respectively throughout the post to denote the inhibition constant and the first order inactivation rate constant.

I’ll start with Section 2.1 (Criteria for Assessing Potency of Covalent Probes) and my comments are italicised in red.

When working with irreversible covalent probes, it is important to consider that target inhibition is time-dependent and therefore IC50 values, while frequently used, are a suboptimal descriptor of potency. (21) Best practice is to use k.inact (the rate of inactivation) over K.i (the affinity for the target) values instead. (22) [I recommend that values of both k.inact and K.i be reported since because this enables the extent of non-covalent target engagement by the chemical probe to be assessed. Regardless of whether binding to target is covalent or non-covalent, the concentration and affinity of substrates (as well as cofactors such as ATP) need be properly accounted for when interpreting effects of chemical probes in cell-based assays. This is a significant issue for ATP-competitive kinase inhibitors (as discussed in my review of S2023) and I recommend this tweetorial from Keith Hornberger.]

As measurement of k.inact/K.i values can be labor-intensive (or in certain cases technically impossible), IC50 values (or target engagement TE50 values) are often reported for covalent leads and used to generate structure–activity relationships (SARs). [The labor-intensive nature of the measurements is not a valid justification for a failure to measure k.inact and K.i values for a covalent chemical probe.] Carefully designed biochemical assays used in determining IC50 values can be well-suited as surrogates for k.inact/K.i measurements. (24) [It is my understanding that the primary reason for doing this is to increase the throughput of irreversible inhibition assays for SAR optimization and I would generally be extremely wary of any IC50 value measured for an irreversible inhibitor if it had not been technically impossible to measure k.inact or K.i values for the inhibitor.]

2.2. Criteria for Assessing Covalent Probe Selectivity

We propose a selectivity factor of 30-fold in favor of the intended target of the probe compared to that of other family members or identified off-targets under comparable assay conditions. [The authors need to be clearer as to which measure of ‘activity’ they propose should be used for calculating the ratio and some justification for the ratio (why 30-fold rather than 50-fold or 25-fold?) should be given. Regardless of whether binding to target is covalent or non-covalent, the concentration and affinity of substrates (as well as cofactors such as ATP) need to be properly accounted for when assessing selectivity. It is not clear how the selectivity factor should be defined to quantify selectivity of an inhibitor that binds covalently to the target but non-covalently to off-targets. My comments on the THZ1 probe in my review of the S2023 study may be relevant.]

2.3. Chemical Matter Criteria for Covalent Probes

Ideally, the on-target activity of the covalent probe is not dominated by the reactive warhead, but the rest of the molecule provides a measurable reversible affinity for the intended target. [My view is that the reversible affinity of the probe should be greater than simply what is measurable and I suggest, with some liberal arm-waving, that a K.i cutoff of ~100 nM might be more useful (a K.i value of 10 μM is usually measurable provided that the inhibitor is adequately soluble in assay buffer).] Seeing SARs over 1–2 log units of activity resulting from core, substitution, and warhead changes is an important quality criterion for covalent probe molecules. [The authors need to be clearer about which ‘activity’ they are referring to (differences in K.i and k.inact values between compounds are likely to be greater than the corresponding differences in k.inact/K.i values). The criterion “SAR for covalent and non-covalent interactions” shown in Figure 2 is nonsensical.]

3.3. Chemical Matter Criteria for Degrader Probes

When selecting chemical degrader probes, it is recommended that a chemist critically assesses the chemical structure of the degrader for the presence of chemical groups that impart polypharmacology or interfere with assay read-outs (PAINs motifs). (78) [I certainly agree that chemists should critically assess chemical structures of probes and, if performing a critical assessment of this nature for a degrader probe, I would be taking a look in ChEMBL to see what’s known for structurally-related compounds. I consider the risk of discarding acceptable chemical matter on the basis of matches with PAINS substructures to be low although there’s a lot more to critical assessment of chemical structures than simply checking for matches against PAINS substructures. My view is that genuine promiscuity (as opposed to frequent hitter behavior resulting from interference with read-out) cannot generally be linked to chemical groups. As noted in K2017 the PAINS substructure model introduced in BH2010 was actually trained on the output of six AlphaScreen assays and the applicability domain of the model should be regarded as prediction of frequent-hitter behavior in this assay panel rather than interference with assay read-outs (that said the most plausible explanation for frequent-hitter behavior in the PAINS assay panel is interference with the AlphaScreen read-out by compounds that quench or react with singlet oxygen). My recommendation is that chemical matter criteria for chemical probes should be specified entirely in terms of measured data and the models used to select/screen potentially acceptable chemical matter should not be included in the chemical matter criteria.]

This is a good point to wrap up my contribution to the important scientific discussion that H2023 is intended to jumpstart. While some of what I've written might be seen as nitpicking please bear in mind that quality criteria for chemical probes need to be defined precisely in order to be useful to the chemical biology and medicinal chemistry communities.

Are fused tetrahydroquinolines interfering with your assay?

2023-12-06T18:42:00.002+00:00

I’ll be taking a look at B2023 (Fused Tetrahydroquinolines Are Interfering with Your Assay) in this post. The article has already been discussed in posts at Practical Fragments and In The Pipeline. In anticipation of the stock straw man counterarguments to my criticisms of PAINS filters, I must stress that there is absolutely no suggestion that compounds matching PAINS filters are necessarily benign. The authors have shown that fusion of cyclopentene at C3-C4 of the tetrahydroquinoline (THQ) ring system is associated with a risk of chemical instability and I consider this to be extremely useful information for anybody thinking about using this scaffold. However, the authors do also appear to be making a number of claims that are not supported by evidence and, in my view, have not demonstrated that the chemical instability leads to pan-assay interference or even frequent-hitter behavior.

The term ‘PAINS’ crops up frequently in B2023 (the authors even refer to “the PAINS concept” although I think that’s pushing things a bit) and I’ll start by saying something about two general types of nuisance behavior of compounds in assays and these points are discussed in more detail in K2017 (Comment on The Ecstasy and Agony of Assay Interference Compounds). From the perspective of screening libraries of compounds for biological activity, the two types of nuisance behavior are very different problems that need to be considered very differently. One criticism that can be made of both BH2010 (original PAINS study) and BW2014 (Chemical con artists foil drug discovery) is that neither study considers the differing implications for drug discovery of these two types of nuisance behavior.

The first type of nuisance behavior in assays is interference with assay read-out and when ‘activity’ in an assay is due to assay interference hits can accurately be described as ‘false positives’ (this should be seen as a problem with the assay rather than the compound). Interference with assay read-outs is certainly irksome when you’re analysing output from screens because you don’t know if the ‘activity’ is real or not. However, if you’re able to demonstrate genuine activity for a compound using an assay with a read-out for which interference is not an issue then interference with other assay read-outs is irrelevant and would not rule out the compound as a viable starting point for further investigation. Interference with assay read-outs generally increases with the concentration of the compound in the assay (this is why biophysical methods are often favored for screening fragments) and I’ll direct readers to a helpful article by former colleagues. It’s also worth noting that interference with read-out can also lead to false negatives.

The second type of nuisance behavior is that the compound acts on a target by an undesirable mechanism of action (MoA) and it is not accurate to describe hits behaving in this manner as ‘false positives’ because the effect on the target is real (this should be seen as a problem with the compound rather than the assay). In contrast to interference with read-out, an undesirable MoA is a show-stopper. An undesirable MoA with which many drug discovery scientists will be familiar is colloidal aggregate formation (see M2003) and the problem can be assessed by running the assay in the absence and presence of detergent (see FS2006). In some cases patterns in screening output may point to an undesirable MoA. For example, cysteine reactivity might be indicated by compounds hitting in multiple assays for inhibition of enzymes that use feature cysteine in their catalytic mechanisms.

I’ll make some comments on PAINS filters before I discuss B2023 in detail and much of what I’ll be saying has already been said in K2017 and C2017 (Phantom PAINS: Problems with the Utility of Alerts for Pan-Assay INterference CompoundS) although you shouldn’t need to consult these articles in order to read the blog post unless you want to get some more detail. The PAINS filter model introduced in BH2010 consists of number of substructures which are claimed (I say “claimed” because the assay results and associated chemical structures are proprietary) to be associated with frequent hitter behavior in a panel of six assays that all use the AlphaScreen read-out (compounds that react with or quench singlet oxygen have the potential of interfere with this read-out). I argued in K2017 that six assays, all using the same read-out, do not constitute a credible basis for the design of an experiment to detect pan-assay interference. Put another way, the narrow scope of the data used to train the PAINS filter model restricts the applicability domain of this model to prediction of frequent-hitter behavior in these six assays. The BH2010 study does not appear present a single example of a compound that has been actually been demonstrated by experiment to exhibit pan-assay interference.

The B2023 study reports that tetrahydroquinolines (THQs) fused at C3-C4 with cyclopentene (1) are unstable. This is valuable information for anybody who may be have the misfortune to be working with this particular scaffold and the observed instability implies that drug discovery scientists should also be extremely wary of any biological activity reported for compounds that incorporate this scaffold. Furthermore, the authors show that the instability can be linked to the presence of the carbon-carbon double bond in the ‘third ring’ since 2, the dihydro analog of 1, appears to be stable. I would certainly mention the chemical instability reported in B2023 if reviewing a manuscript that reported biological activity for compounds based on this scaffold. However, I would not mention that BH2010 has stated that the scaffold matches the anil_alk_ene (SLN: C[1]:C:C:C[4]:C(:C:@1)NCC[9]C@4C=CC@9 ) PAINS substructure because the nuisance behavior consists of hitting frequently in a six-assay panel of questionable relevance and the PAINS filters were based on analysis of proprietary data.

Although I wouldn’t have predicted the chemical instability reported for 1 by B2023, this scaffold is certainly not a structural feature that I would have taken into lead optimization with any enthusiasm (a hydrogen that is simultaneously benzylic and allylic does rather look like a free lunch for the CYPs). I would still be concerned about instability even if methylene groups were added to or deleted from the aliphatic parts of 1. I suspect that the electron-releasing nitrogen of 1 contributes to chemical instability although I don’t think that changing nitrogen for another atom type would eliminate the risk of chemical instability. Put another way, the instability observed for 1 should raise questions about the stability of a number of structurally-related scaffolds. Chemical instability is (or at least should be) a show-stopper in the context of drug discovery even if doesn't lead to interference with assay read-out, an undesirable MoA or pan-assay interference.

I certainly consider the instability observed for 1 to be of interest and relevant to a number of structurally-related chemotypes. However, I have a number of concerns about B2023 and one specific criticism is that the authors use “tricyclic/fused THQ” as a synonym throughout the text as a synonym for “tricyclic/fused THQ with a carbon-carbon double bond in the ‘third’ ring”. At best this is confusing and it could lead to groundless criticism, either publicly or in peer review, of a study that reported assay results for compounds based on the scaffold in 2. A more general point is that the authors make a number of claims that, in my view, are not adequately supported by evidence. I’ll start with the significance section and my comments are italicized in red:

Tricyclic tetrahydroquinolines (THQs) are a family of lesser studied pan-assay interference compounds (PAINS) [The authors need to provide specific examples of tricyclic THQs that have been actually been shown to exhibit pan-assay interference to support this claim.] These compounds are found ubiquitously throughout commercial and academic small molecule screening libraries. [The authors do not appear to have presented evidence to support this claim and the presence of compounds in vendor catalogues does not prove that the compounds are actually being screened. In my view, the authors appear to be trying to ‘talk up’ the significance of their findings by making this statement.] Accordingly, they have been identified as hits in high-throughput screening campaigns for diverse protein targets. We demonstrate that fused THQs are reactive when stored in solution under standard laboratory conditions and caution investigators from investing additional resource into validating these nuisance compounds.

Continuing with the introduction

Fused tetrahydroquinolines (THQs) are frequent hitters in hit discovery campaigns. [In my view the authors have not presented sufficient evidence to support this statement and I don’t consider claims made in the BH2010 for frequent-hitter behavior by compounds matching the anil_alk_ene PAINS substructure to be admissible as evidence simply because they are based on proprietary data. In any case the numbers of compounds matching the anil_alk_ene PAINS substructure and reported in BH2010 to hit in zero (17) or one (11) assays in the PAINS assay panel suggest that 28 compounds (of a total of 51 substructural matches) cannot be regarded as frequent-hitters in this assay panel.] Pan-assay interference compounds (PAINS) have been controversial in the recent literature. While some literature supports these as nuisance compounds, other papers describe PAINS as potentially valuable leads. (1 | 2 | 3 | 4) [The C2017 study referenced as 2 is actually a critique of PAINS filters and I’m assuming that the authors aren’t suggesting that it “supports these [PAINS] as nuisance compounds”. However, I would consider it a gross misrepresentation of C2017 to imply that the study describes “PAINS as potentially valuable leads”.] There have been descriptions of many different classes of PAINS that vary in their frequency of occurrence as hits in the screening literature. [In my view, the number of articles on PAINS appears to greatly exceed the number of compounds that have actually been shown to exhibit pan-assay interference.]

The number of papers that selected this scaffold during hit discovery campaigns from multiple chemical libraries supports the idea that fused THQs are frequent hitters. [Let’s take a closer look at what the authors are suggesting by considering a selection of compounds, each of which has a benzene ring in its molecular structure. Now let’s suppose that each of a large number of targets is hit by at least one of the compounds in this selection (I could easily satisfy this requirement by selecting marketed drugs with benzene rings in their molecular structures). Applying the same logic as the authors, I could use these observations to support the idea that compounds incorporating benzene rings in their molecular structures are frequent-hitters. In my view the B2023 study doesn’t appear to have presented a single example of a fused THQ that has actually been shown experimentally to exhibit frequent-hitter behavior. As mentioned earlier in this post less than half of the compounds matching the anil_alk_ene PAINS substructure that were evaluated in the BH2010 assay panel can be regarded as frequent-hitters.] At first glance, these compounds appear to be valid, optimizable hits, with reasonable physicochemical properties. Although micromolar and reproducible activity has been reported for multiple THQ analogues on many protein targets, hit-to-lead optimization programs aimed at improving the initial hits (Supporting Information (SI), Table S1) have resulted in no improvement in potency or no discernible structure–activity relationships (SAR) [Achieving increased potency and establishing SARs are certainly important objectives in hit-to-lead studies. However, assertions that hit-to-lead optimizations “have resulted in no improvement in potency or no discernible structure–activity relationships” do need to be supported with appropriate discussion of specific hit-to-lead optimization studies.]

Examples of Fused THQs as “Hits” Are Pervasive

The diversity of protein targets captured below supports the premise that the fused THQ scaffold does not yield specific hits for these proteins but that the reported activity is a result of pan-assay interference. [I could use an argument analogous to the one that I’ve just used for frequent-hitters to ‘prove’ that compounds with benzene rings in their molecular structure do not yield specific hits and that any reported activity is due to pan-assay interference. The authors do not appear to have presented a single example of a fused THQ that has been shown by experiment to exhibit pan-assay interference.]

Concluding remarks

Our review and evidence-based experiments solidify the idea that tricyclic THQs are nuisance compounds that cause pan-assay interference in the majority of screens rather than privileged structures worthy of chemical optimization. [While I certainly agree that chemical instability would constitute a nuisance, I would consider it wildly extravagant to claim that tricyclic THQs can “cause pan assay interference” since nobody appears to have actually observed pan-assay interference for even a single tricyclic THQ.] Their widespread micromolar activities on a broad range of proteins with diverse assay readouts support our assertion that they are unlikely to be valid hits. [As stated previously, I do not consider that “widespread micromolar activities on a broad range of proteins” observed for compounds that share a particular structural feature implies that all compounds with the particular structural feature are unlikely to be valid hits.]

So that concludes my review of the B2023 study. I really liked the experimental work that revealed the instability of 1 and linked it to the presence of the double bond in the 'third' ring. Furthermore, these experimental results would (at least for me) raise questions about the chemical stability of some scaffolds that are structurally-related to 1. However, I found the analysis of the bioactivity data reported in the literature for fused THQs to be unconvincing to the extent that it significantly weakened the B2023 study.

On the misuse of chemical probes

2023-11-19T17:36:00.009+00:00

It’s now time to get back to chemical probes and I’ll be taking a look at S2023 (Systematic literature review reveals suboptimal use of chemical probes in cell-based biomedical research) which has already been reviewed in blog posts from Practical Fragments, In The Pipeline and the Institute of Cancer Research. Readers of this blog are aware that PAINS filters usually crop up in posts on chemical probes but there are other things that I want to discuss and, in any case, references to PAINS in S2023 are minimal. Nevertheless, I’ll still stress that a substructural match of a chemical probe with a PAINS filter does not constitute a valid criticism of a chemical probe (it simply means that the chemical structure of the chemical probe shares structural features with compounds that have been claimed to exhibit frequent-hitter behaviour in a panel of six AlphaScreen assays) and one is more likely to encounter a bunyip than a compound that has actually been shown to exhibit pan-assay interference.

The authors of S2023 claim to have revealed “suboptimal use of chemical probes in cell-based biomedical research” and I’ll start by taking a look at the abstract (my annotations are italicised in red):

Chemical probes have reached a prominent role in biomedical research, but their impact is governed by experimental design. To gain insight into the use of chemical probes, we conducted a systematic review of 662 publications, understood here as primary research articles, employing eight different chemical probes in cell-based research. [A study such as S2023 that has been claimed by its authors to be systematic does need to say something about how the eight chemical probes were selected and why the literature for this particular selection of chemical probes should be regarded as representative of chemical probes literature in general.] We summarised (i) concentration(s) at which chemical probes were used in cell-based assays, (ii) inclusion of structurally matched target-inactive control compounds and (iii) orthogonal chemical probes. Here, we show that only 4% of analysed eligible publications used chemical probes within the recommended concentration range and included inactive compounds as well as orthogonal chemical probes. [I would argue that failure to use a chemical probe within a recommended concentration range is only a valid criticism if the basis for the recommendation is clearly articulated.] These findings indicate that the best practice with chemical probes is yet to be implemented in biomedical research. [My view is that the best practice with chemical probes is yet to be defined.] To achieve this, we propose ‘the rule of two’: At least two chemical probes (either orthogonal target-engaging probes, and/or a pair of a chemical probe and matched target-inactive compound) to be employed at recommended concentrations in every study. [The authors of S2023 do seem to moving the goalposts since the they’ve criticized studies for not using structurally matched target-inactive control compounds but are saying that using an additional orthogonal target-engaging probe makes it acceptable not to use a structurally matched target-inactive control compound. This suggestion does appear to contradict the Chemical Probes Portal criteria for 'classical' modulators which do require the use of a control compound defined as having a "similar structure with similar physicochemistry, non-binding against target".]

The following sentence does set off a few warning bells for me:

The term ‘chemical probe’ distinguishes compounds used in basic and preclinical research from ‘drugs’ used in the clinic, from the terms ‘inhibitor’, ‘ligand’, ‘agonist’ or ‘antagonist’ which are molecules targeting a given protein but are insufficiently characterised, and also from the term ‘probes’ which is often referring to laboratory reagents for biophysical and imaging studies.

First, the terms 'compound' and 'molecule' are not interchangeable and I would generally recommend using 'compound' when talking about biological activity or affinity. A more serious problem is that the authors of S2023 seem to be getting into homeopathic territory by suggesting that chemical probes are not ligands and this might have caused Paul Ehrlich (who died 26 years before Kaiser Wilhelm II) to spit a few feathers. Drugs and chemical probes are ligands for their targets by virtue of binding to their targets (the term 'ligand' is derived from the Latin 'ligare' which means 'to bind' and a compound can be a ligand for one target without necessarily being a ligand for another target) while the terms 'inhibitor', 'agonist' and 'antagonist' specify the consequences of ligand binding. I was also concerned by the use of the term 'in cell concentration' in S2023 given that uncertainty in intracellular concentration is an issue when working with chemical probes (as well as in PK-PD modelling). Although my comments above could be seen as nit-picking these are not the kind of errors that authors can afford to make if they’re going to claim that their “findings indicate that the best practice with chemical probes is yet to be implemented in biomedical research”.

Let’s take a look at the criteria by which the authors of S2023 have assessed the use of chemical probes. They assert that “Even the most selective chemical probe will become non-selective if used at a high concentration” although I think it’d be more correct to state that the functional selectivity of a probe depends on binding affinity of the probe for target and anti-targets as well as the concentration of the probe (at its site of action). Selectivity also depends on the concentration of anything that binds competitively with the probe and, when assessing kinase selectivity, it can be argued that assays for ATP-competitive kinase inhibitors should be run at a typical intracellular ATP concentration (here’s a recent open access review on intracellular ATP concentration). The presence of serum in cell-based assays should also be considered when setting upper concentration limits since chemical probes may bind to serum proteins such as albumin which means that the concentration of a compound that is ‘seen’ by the cells is lower than the total concentration of the compound in the assay. In my experience binding to albumin tends to increase with lipophilicity and is also favored by the presence of an acidic group such as carboxylate in a molecular structure.

I’m certainly not suggesting that chemical probes be used at excessive concentrations but if you’re going to criticise other scientists for exceeding concentration thresholds then, at very least, you do need to show that the threshold values have been derived in an objective and transparent manner. My view that it would not be valid to criticise studies publicly (or in peer review of submitted manuscripts) simply because the studies do not comply with recommendations made by the Chemical Probes Portal. It is significant that the recommendations from different groups of chemical probe experts with respect to the maximum concentration at which UNC1999 should be used differ by almost an order of magnitude:

As the recommended maximal in-cell concentration for UNC1999 varies between the Chemical Probes Portal and the Structural Genomics Consortium sites (400 nM and 3 μM, respectively), we analysed compliance with both concentrations.

One of the eight chemical probes featured in S2023 is THZ1 which is reported to bind covalently to CDK7 and the electrophilic warhead is acrylamide-based, suggesting that binding is irreversible. Chemical probes that form covalent bonds with their targets irreversibly need to be considered differently to chemical probes that engage their targets reversibly (see this article). Specifically, the degree of target engagement by a chemical probe that binds irreversibly depends on time as well as concentration (if you wait long enough then you’ll achieve 100% inhibition). This means that it’s not generally possible to quantify selectivity or to set concentration thresholds objectively for chemical probes that bind to their targets irreversibly. It’s not clear (at least to me) why an irreversible covalent inhibitor such as THZ1 was included as one of the eight chemical probes covered by the S2023 study so I checked to see what the Chemical Probes Portal had to say about THZ1 and something doesn’t look quite right. The on-target potency is given as a Kd (dissociation constant which is a measure of affinity) value of 3.2 nM and the potency assay is described as “time-dependent binding established supporting covalent mechanism”. However, Kd is a measure of affinity (and therefore not a time-dependent) and my understanding is that it is generally difficult to measure Kd for irreversible covalent inhibitors which are typically characterized by kinact (inactivation rate constant) and Ki (inhibition constant) values obtained from analysis of enzyme inhibition data. The off-target potency of THZ1 is summarized as “KiNativ profiling against 246 kinases in Loucy cells was performed showing >75% inhibition at 1 uM of: MLK3, PIP4K2C, JNK1, JNK2, JNK3, MER, TBK1, IGF1R, NEK9, PCTAIRE2, and TBK1, but in vitro binding to off-target kinases was not time dependent indicating that inhibition was not via a covalent mechanism”. The results from the assays used to measure on-target and off-target potency for THZ1 do not appear to be directly comparable.

It’s now time to wrap up and I suggest that it would not be valid to criticise (either publicly or in peer review) a study simply on the grounds that it reported results of experiments in which a chemical probe was used at a concentration exceeding a recommended maximum value. The S2023 authors assert that an additional orthogonal target-engaging probe can be substituted for a matched target-inactive control compound but this appears to contradict criteria for classical modulators given by the Chemical Probes Portal.

Five days in Vermont

2023-09-27T21:27:00.012+01:00

A couple of months ago I enjoyed a visit to the US (my first for eight years) on which I caught up with old friends before and after a few days in Vermont (where a trip to the golf course can rapidly become a National Geographic Moment). One highlight of the trip was randomly meeting my friend and fellow blogger Ash Jogalekar for the first time in real life (we’ve actually known each other for about fifteen years) on the Boston T Red Line. Following a couple of nights in green and leafy Belmont, I headed for the Flatlands with an old friend from my days in Minnesota for a Larry Miller group reunion outside Chicago before delivering a short harangue on polarity at Ripon College in Wisconsin. After the harangues, we enjoyed a number of most excellent Spotted Cattle (Only in Wisconsin) in Ripon. I discovered later that one of my Instagram friends is originally from nearby Green Lake and had taken classes at Ripon College while in high school. It is indeed a small world.

The five days spent discussing computer-aided drug design (CADD) in Vermont are what I’ll be covering in this post and I think it’s worth saying something about what drugs need to do in order to function safely. First, drugs need to have significant effects on therapeutic targets without having significant effects on anti-targets such as hERG or CYPs and, given the interest in new modalities, I’ll be say “effects” rather than “affinity”, although Paul Ehrlich would have reminded us that drugs need to bind in order to exert effects. Second, drugs need to get to their targets at sufficiently high concentrations for their effects to be therapeutically significant (drug discovery scientists use the term ‘exposure’ when discussing drug concentration). Although it is sometimes believed that successful drugs simply reduce the numbers of patients suffering from symptoms it has been known from the days of Paracelsus that it is actually the dose that differentiates a drug from a poison.

Drug design is often said to be multi-objective in nature although the objectives are perhaps not as numerous as many believe (this point is discussed in the introduction section of NoLE, an article that I'd recommend to insomniacs everywhere). The first objective of drug design can be stated in terms of minimization of the concentration at which a therapeutically useful effect on the target is observed (this is typically the easiest objective to define since drug design is typically directed at specific targets). The second objective of drug design can be stated in analogous terms as maximization of the concentration at which toxic effects on the anti-targets are observed (this is a more difficult objective to define because we generally know less about the anti-targets than about the targets). The third objective of drug design is to achieve controllability of exposure (this is typically the most difficult objective to define because drug concentration is a dose-dependent, spaciotemporal quantity and intracellular concentration cannot generally be measured for drugs in vivo). Drug discovery scientists, especially those with backgrounds in computational chemistry and cheminformatics, don’t always appreciate the importance of controlling exposure and the uncertainty in intracellular concentration always makes for a good stock question for speakers and panels of experts.

I posted previously on artificial intelligence (AI) in drug design and I think it’s worth highlighting a couple of common misconceptions. The first misconception is that we just need to collect enough data and the drugs will magically condense out of the data cloud that has been generated (this belief appears to have a number of adherents in Silicon Valley). The second misconception is that drug design is merely an exercise in prediction when it should really be seen in a Design of Experiments framework. It’s also worth noting that genuinely categorical data are rare in drug design and my view is that many (most?) "global" machine learning (ML) models are actually ensembles of local models (this heretical view was expressed in a 2009 article and we were making the point that what appears to be an interpolation may actually be an extrapolation). Increasingly, ML is becoming seen as a panacea and it’s worth asking why quantitative structure activity relationship (QSAR) approaches never really made much of a splash in drug discovery.

I enjoyed catching up with old friends [ D | K | S | R/J | P/M ] as well as making some new ones [ G | B/R | L ]. However, I was disappointed that my beloved Onkel Hugo was not in attendance (I continue to be inspired by Onkel’s laser-like focus on the hydrogen bonding of the ester) and I hope that Onkel has finally forgiven me for asking (in 2008) if Austria was in Bavaria. There were many young people at the gathering in Vermont and their enthusiasm made me greatly optimistic for the future of CADD (I’m getting to the age at which it’s a relief not to be greeted with: "How nice to see you, I thought you were dead!"). Lots of energy at the posters (I learned from one that Voronoi was Ukrainian) although, if we’d been in Moscow, I’d have declined the refreshments and asked for a room on the ground floor (left photo below). Nevertheless, the bed that folded into the wall (centre and right photos below) provided plenty of potential for hotel room misadventure without the ‘helping hands’ of NKVD personnel.

It'd been four years since CADD had been discussed at this level in Vermont so it was no surprise to see COVID-19 on the agenda. The COVID-19 pandemic led to some very interesting developments including the Covid Moonshot (a very different way of doing drug discovery and one I was happy to contribute to during my 19 month sojourn in Trinidad) and, more tangibly, Nirmatrelvir (an antiviral medicine that has been used to treat COVID-19 infections since early 2022). Looking at the molecular structure of Nirmatrelvir you might have mistaken trifluoroacetyl for a protecting group but it’s actually a important feature (it appears to be beneficial from the permeability perspective). My view is that the alkane/water logP (alkane is a better model than octanol for the hydrocarbon core of a lipid bilayer) for a trifluoroacetamide is likely to be a couple of log units greater than for the corresponding acetamide.

I’ll take you through how the alkane/water logP difference between a trifluoroacetamide and corresponding acetamide can be estimated in some detail because I think this has some relevance to using AI in drug discovery (I tend to approach pKa prediction in an analogous manner). Rather than trying to build an ML model for making the prediction, I’ve simply made connections between measurements for three different physicochemical properties (alkane/water logP, hydrogen bond basicity and hydrogen bond acidity) which is something that could easily be accommodated within an AI framework. I should stress that this approach can only be used because it is a difference in alkane/water logP (as opposed to absolute values) that is being predicted and these physicochemical properties can plausibly be linked to substructures.

Let’s take a look at the triptych below which I admit that is not quite up to the standards of Hieronymus Bosch (although I hope that you find it to be a little less disturbing). The first panel shows values of polarity (q) for some hydrogen bond acceptors and donors (you can find these in Tables 2 and 3 in K2022) that have been derived from alkane/water logP measurements. You could, for example, use these polarity values to predict that reducing the polarity of an amide carbonyl oxygen to the extent that it looks like a ketone will lead to a 2.2 log unit increase in alkane/water logP. The second panel shows measured hydrogen bond basicity values for three hydrogen bond acceptors (you can find these in this freely available dataset) and the values indicate that a trifluoroacetamide is an even weaker hydrogen bond acceptor than a ketone. Assuming a linear relationship between polarity and hydrogen bond basicity, we can estimate that the trifluoroacetamide carbonyl oxygen is 2.4 log units less polar than the corresponding acetamide. The final panel shows measured hydrogen bond acidity values (you can find these in Table 1 of K2022) that suggest that an imide NH (q = 1.3; 0.5 log units more polar than typical amide NH) will be slightly more polar than the trifluoroacetamide NH of Nirmatrelvir. So to estimate he difference in alkane/water logP values you just need to subtract the additional polarity of trifluoroacetamide NH (0.5 log units) from the lower polarity of the trifluoroacetamide carbonyl oxygen (2.4) to get 1.9 log units.

Chemical space is a recurring theme in drug design and its vastness, which defies human comprehension, has inspired much navel-gazing over the years (it’s actually tangible chemical space that’s relevant to drug design). In drug discovery we need to be able to navigate chemical space (ideally without having to ingest huge quantities of Spice) and, given that Ukrainian chemists have revolutionized the world's idea of tangible chemical space (and have also made it a whole lot larger), it is most appropriate to have a Ukrainian guide who is most ably assisted by a trusty Transylvanian sidekick. I see benefits from considering molecular complexity more explicitly when mapping chemical space.

AI (as its evangelists keep telling us) is quite simply awesome at generating novel molecular structures although, as noted in a previous post, there’s a little bit more to drug design than simply generating novel molecular structures. Once you’ve generated a novel molecular structure you need to decide whether or not to synthesize the compound and, in AI-based drug design, molecular structures are often assessed using ML models for biological activity as well as absorption, distribution, metabolism and excretion (ADME) behaviour. It’s well-known that you need a lot of data for training these ML models but you also need to check that the compounds for which you’re making predictions lie within the chemical space occupied by the training set (one way to do this is to ensure that close structural analogs of these compounds exist in the training set) because you can’t be sure that the big data necessarily cover the regions of chemical space of interest to drug designers using the models. A panel discusses the pressing requirement for more data although ML modellers do need to be aware that there’s a huge difference between assembling data sets for benchmarking and covering chemical space at sufficiently high resolution to enable accurate prediction for arbitrary compounds.

There are other ways to think about chemical space. For example, differences in biological activity and ADME-related properties can also be seen in terms of structural relationships between compounds. These structural relationships can be defined in terms of molecular similarity (Tanimoto coefficient for the molecular fingerprints of X and Y is 0.9) or substructure (X is the 3-chloro analog of Y). Many medicinal chemists think about structure-activity relationships (SARs) and structure-property relationships (SPRs) in terms of matched molecular pairs (MMPs: pairs of molecular structures that are linked by specific substructural relationships) and free energy perturbation (FEP) can also be seen in this framework. Strong nonadditivity and activity cliffs (large differences in activity observed for close structural analogs) are of considerable interest as SAR features in their own right and because prediction is so challenging (and therefore very useful for testing ML and physics-based models for biological activity). One reason that drug designers need to be aware of activity cliffs and nonadditivity in their project data is that these SAR features can potentially be exploited for selectivity.

Cheminformatic approaches can also help you to decide how to synthesize the compounds that you (or your AI Overlords) have designed and automated synthetic route planning is a prerequisite for doing drug discovery in ‘self-driving’ laboratories. The key to success in cheminformatics is getting your data properly organized before starting analysis and the Open Reaction Database (ORD), an open-access schema and infrastructure for structuring and sharing organic reaction data, facilitates training of models. One area that I find very exciting is the use of high-throughput experimentation in the search for new synthetic reactions which can led to better coverage of unexplored chemical space. It’s well known in industry that the process chemists typically synthesize compounds by routes that differ from those used by the medicinal chemists and data-driven multi-objective optimization of catalysts can lead to more efficient manufacturing processes (a higher conversion to the desired product also makes for a cleaner crude product).

It’s now time to wrap up what’s been a long post. Some of what is referred to as AI appears to already be useful in drug discovery (especially in the early stages) although non-AI computational inputs will continue to be significant for the foreseeable future. I see a need for cheminformatic thinking in drug discovery to shift from big data (global ML models) to focused data (generate project specific data efficiently for building local ML models) and also see advantages in using atom-based descriptors that are clearly linked to molecular interactions. One issue for data-driven approaches to prediction of biological activity such as ML and QSAR modelling is that the need for predictive capability is greatest when there's not much relevant data and this is a scenario under which physics-based approaches have an advantage. In my view, validation of ML models is not a solved problem since clustering in chemical space can cause validation procedures to make optimistic assessments of model quality. I continue to have significant concerns about how relationships (which are not necessarily linear) between descriptors are handled in ML modelling and remain generally skeptical of claims for interpretability of ML models (as noted in NoLE, the contribution of a protein–ligand contact to affinity is not, in general, an experimental observable).

Many thanks for staying with me to the end and hope to see many of you at EuroQSAR in Barcelona next year. I'll leave you with a memory from the early days of chemical space navigation.

Blogger Meets Blogger

2023-07-26T16:29:00.004+01:00

Over the years I’ve had had some cool random encounters (some years ago I bumped into a fellow member of the Macclesfield diving club in the village of Pai in the north of Thailand) but the latest is perhaps the most remarkable (even if it's not quite in the league of Safecracker Meets Safecracker in Surely You’re Joking). I was riding the Red Line on Boston’s T en route to Belmont from a conference in Vermont when my friend Ash Jogalekar, well known for The Curious Wavefunction blog, came over and introduced himself. Ash and I have actually known each other for about 15 years but we’d never before met in real life.

The odds against such an encounter would appear to be overwhelming since Ash lives in California while this was my first visit to the USA since 2015. I had also explored the possibility of getting a ride to Boston (some of those attending had driven to the conference from there) because the bus drops people off at the airport. Furthermore, I was masked on the T which made it more difficult for Ash to recognize me. However, I was carrying my poster tube (now re-purposed for the transport of unclean underwear) and, fortuitously, the label with my name was easy for Ash to spot. Naturally, we discussed the physics of ligand efficiency.

AI-based drug design?

2023-07-18T20:20:00.009+01:00

|| >> Next

I’ll start this post by stressing that I’m certainly not anti-AI. I actually believe that drug design tools that are being described as AI-based are potentially very useful in drug discovery. For example, I’d expect natural language processing capability to enable drug discovery scientists to access relevant information without actually having to create database queries. I actually have a long-standing interest in automated molecular structure editing (see KS2005) and see the ability to build chemical structures in an automated manner using Generative AI as a potentially useful addition to the drug designer’s arsenal. Physical chemistry is very important in drug design and there are likely benefits to be had from building physicochemical awareness into the AI tools (one approach would be to use atom-based measures of interaction potential and I’ll direct you to some relevant articles: A1989 | K1994 | LB2000 | H2004 | L2009 | K2009 | L2011 | K2016 | K2022)

All that said, the AI field does appear to be associated with a degree of hype and number of senior people in the drug discovery field seem to have voluntarily switched off their critical thinking skills (it might be a trifle harsh to invoke terms like “herding instinct” although doing so will give you a better idea of what I’m getting at). Trying to deal with the diverse hype of AI-based drug design in a single blog post is likely to send any blogger on a one-way trip to the funny farm so I’ll narrow the focus a bit. Specifically, I’ll be trying to understand the meaning of the term “AI-designed drug”.

The prompt for this post came from the publication of “Inside the nascent industry of AI-designed drugs” DOI in Nature Medicine and I don’t get the impression that the author of the article is too clued up on drug design:

Despite this challenge, the use of artificial intelligence (AI) and machine learning to understand drug targets better and synthesize chemical compounds to interact with them has not been easy to sell.

Apparently, AI is going to produce the drugs as well as design them:

“We expect this year to see some major advances in the number of molecules and approved drugs produced by generative AI methods that are moving forward”, Hopkins says.

I’d have enjoyed being a fly on the wall at this meeting although perhaps they should have been asking “why” rather than “how”:

“They said to me: Alex, these molecules look weird. Tell us how you did it”, Zhavaoronkov [sic] says. "We did something in chemistry that humans could not do.”

So what I think it means to claim that a drug has been “AI-designed” is that the chemical structure of the drug has been initially generated by a computer rather than a human (I’ll be very happy to be corrected on this point). Using computers to generate chemical structures is not exactly new and people were enumerating combinatorial libraries from synthetic building blocks over two decades ago (that’s not to deny that there has been considerable progress in the field of generating chemical structures). Merely conceiving a structure does not, however, constitute design and I’d question how accurate it would be to use the term “AI-designed” if structures generated by AI had been subsequently been evaluated using non-AI methods such as free energy perturbation.

One piece of advice that I routinely offer to anybody seeking to transform or revolutionize drug discovery is to make sure that you understand what a drug needs to do. First, the drug needs to interact to a significant extent with one or more therapeutic targets (while not interacting with anti-targets such as hERG and CYPs) and this is why molecular interactions (see B2010 | P2015 ) are of great interest in medicinal chemistry. Second, the drug needs to get to its target(s) at a sufficiently high concentration (the term exposure is commonly used in drug discovery) in order to have therapeutically useful effects on the target(s). This means that achieving controllability of exposure should be seen as a key objective of drug design. One of the challenges facing drug designers is that it’s not generally possible to measure intracellular concentration for drugs in vivo and I recommend that AI/ML leaders and visionaries take a look at the SR2019 study.

Given that this post is focused on how AI generates chemical structures, I thought it might be an idea to look at how human chemists currently decide which compounds are to be synthesized. Drug design is incremental which reflects the (current) impossibility of accurately predicting the effects that a drug will have on a human body directly from its molecular structure. Once a target has been selected, compounds are screened for having a desired effect on the target and the compounds identified in the screening phase are usually referred to as hits.

The screening phase is followed by the hit-to-lead phase and it can be helpful to draw an analogy between drug discovery and what is called football outside the USA. It’s not generally possible to design a drug from screening output alone and to attempt to do so would be the equivalent of taking a shot at goal from the centre spot. Just as the midfielders try move the ball closer to the opposition goal, the hit-to-lead team use the screening hits as starting points for design of higher affinity compounds. The main objective in the hit-to-lead phase to generate information that can be used for design and mapping structure-activity relationships for the more interesting hits is a common activity in hit-to-lead work.

The most attractive lead series are optimized in the lead optimization phase. In addition to designing compounds with increased affinity, the lead optimization team will generally need to address specific issues such as inadequate oral absorption, metabolic liability and off-target activity. Each compound synthesized during the course of a lead optimization campaign is almost invariably a structural analog of a compound that had already been synthesized. Lead optimization tends to be less ‘generic’ than lead identification because the optimization path is shaped by these specific issues which implies that ML modelling is likely to be less applicable to lead optimization than to lead identification.

This post is all about how medicinal chemists decide which compounds get synthesized and these decisions are not made in a vacuum. The decisions made by lead optimization chemists are constrained by the leads identified by the hit-to-lead team just as the decisions made by lead identification chemists are constrained by the screening output. While AI methods can easily generate chemical structures, it's currently far from clear that AI methods can eliminate the need for humans to make decisions as to which compounds actually get synthesized.

This is a good point at which to wrap up. One error commonly made by people with an AI/ML focus is to consider drug design purely as an exercise in prediction while, in reality, drug design should be seen more in a Design of Experiments framework.

Archbishop Ussher's guide to efficient selection of development candidates

2023-06-08T19:37:00.008+01:00

One piece of advice I gave in NoLE is that “drug designers should not automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working” and the L2021 study that I’m reviewing in this post will give you a good idea of what I was getting at when I wrote that. I see a fair amount of relatively harmless “stamp collecting” in L2021 but there are also some rather less harmless errors of the type that you really shouldn’t be making if cheminformatics is your day job.

I’ll start the review of L2021 with annotation of the abstract:

"Physicochemical descriptors commonly used to define ‘drug-likeness’ and ligand efficiency measures are assessed for their ability to differentiate marketed drugs from compounds reported to bind to their efficacious target or targets. [I would argue that differentiating an existing drug from existing compounds that bind to the same target is not something that medicinal chemists need to be able to do. It is also incorrect to describe efficiency metrics such as LE and LLE as physicochemical descriptors because they are derived from biological activity measurements such as binding affinity or potency.] Using ChEMBL version 26, a data set of 643 drugs acting on 271 targets was assembled, comprising 1104 drug−target pairs having ≥100 published compounds per target. Taking into account changes in their physicochemical properties over time, drugs are analyzed according to their target class, therapy area, and route of administration. Recent drugs, approved in 2010−2020, display no overall differences in molecular weight, lipophilicity, hydrogen bonding, or polar surface area from their target comparator compounds. Drugs are differentiated from target comparators by higher potency, ligand efficiency (LE), lipophilic ligand efficiency (LLE), and lower carboaromaticity. [I may be missing something but stating that drugs tend to differ in potency from non-drugs that hit the same targets does rather seem to be stating the obvious. The same point can also be made about efficiency metrics such as LE and LLE since these are derived, respectively, by scaling potency with respect to molecular size and offsetting potency with respect to lipophicity (LLE).] Overall, 96% of drugs have LE or LLE values, or both, greater than the median values of their target comparator compounds.” [What is the corresponding figure for potency?]

I must admit to never having been a fan of drug-likeness studies such as L2021 (when I first encountered analyses of time dependency of drug properties about 20 years ago I was left with an impression that some senior medicinal chemists had a bit too much time on their hands) and it is now ten years since the term "Ro5 envy" was introduced in a notorious JCAMD article. My view is that the data analysis presented in L2021 has minimal relevance to drug discovery so I’ll be saying rather less about the data analysis than I’d have done had J Med Chem asked me to review the study.

The L2021 study examines property differences between marketed drugs and compounds reported to bind to efficacious target(s) of each drug. Specifically, the property differences are quantified by difference between the value of the property for the drug and the median of the values of property for the target comparator compounds. If doing this then you really do need to account for the spread in the distribution if you’re going to interpret property differences like these (a large difference in values of a property for the drug and the median property for the target may simply reflect a wide spread in the property distribution for the target). However, I would argue that a more sensible starting point for analysis like this would be to locate (e.g., as a percentile) the value of each drug property within the corresponding property distribution for the target comparator compounds.

Let’s take a look now at how the authors of L2021 suggest their study be used.

“This study, like all those looking at marketed drug properties, is necessarily retrospective. Nevertheless, those small molecule drug properties that show consistent differentiation from their target compounds over time, namely, potency, ligand efficiencies (LE and LLE), and the aromatic ring count and lipophilicity of carboaromatic drugs, are those that are most likely to remain future-proof. Candidate drugs emerging from target-based discovery programs should ideally have one, or preferably both, of their LE and LLE values greater than the median value for all other compounds known to be acting at the target.”

I would argue that the L2021 study has absolutely no relevance whatsoever to the selection of compounds for development since the team will have data available that enables them to rule out the vast majority of the project compounds for nomination. A discovery team nominating a compound for development will have achieved a number of challenging objectives (including potency against target and in one or more cell-based assays) and the likely response of team members to a suggestion that they calculate medians for LE and LLE for comparison with nomination candidate(s) is likely to be bemused eye-rolling. In general, a discovery team nominating a development candidate has access to a lot of unpublished potency measurements (which won’t be in ChEMBL) and it’s usually a safe assumption that the development candidate will be selected from the most potent compounds (LE and LLE values for these compounds are also likely to be above average). In the extremely unlikely event that the discovery team nominates a compound with LE or LLE values below the magic median values then you can be confident that the decision has been based on examination of measured data (consider the likelihood of the discovery team members acting on a suggestion that they should pick another compound with LE or LLE value above the magic median values because doing so will increase the probability of success in clinical development).

As the start of the post, I did mention some errors that you don’t want to be making if cheminformatics is your day job and regular readers of this blog will have already guessed that I’m talking about ligand efficiency (LE). I should point out l that the problem is with the ligand efficiency metric and not the ligand efficiency concept which is both scientifically sound and useful, especially in fragment-based design where molecular size often increases significantly in the hit-to-lead phase.

The problem with the LE metric is that perception of efficiency changes when you express affinity (or potency) using a different unit and this is shown clearly in Table 1 in NoLE. Expressing a quantity using a different unit doesn’t change the quantity so any change in perception is clearly physical nonsense. That’s why I appropriate a criticism (it’s not even wrong) usually attributed to Pauli when taking gratuitous pot shots at the LE metric. The change in perception is also cheminformatic nonsense and that’s why it’s rather unwise to use the LE metric if cheminformatics is your day job. L2021 does cite NoLE but simply notes the LE metric’s “scientific basis and application have provoked a literature debate”.

The L2021 study asserts that “the absolute LE value of a drug candidate is less important” but the problem is that even differences in LE change when you express affinity (or potency) using a different concentration unit. This is shown in Table 2 in NoLE and the problem is that there is no objective way to select a particular concentration unit as ‘better’ than all the other concentration units. To conclude, can we say that a medicinal chemistry leader’s choice of concentration unit (1 M) is any better (or any worse) than that of Archbishop Ussher (4.004 μM)?

A clear demonstration of the benefits of long residence time

2023-04-01T06:43:00.009+01:00

<< previous || next >>

Residence time is a well-established concept in drug discovery and the belief that off-rate is more important than affinity has many adherents in both academia and industry. The concept has been articulated as follows in a Nature Reviews in Drug Discovery article:

“Biochemical and cellular assays of drug interactions with their target macromolecules have traditionally been based on measures of drug–target binding affinity under thermodynamic equilibrium conditions. Equilibrium binding metrics such as the half-maximal inhibitory concentration (IC50), the effector concentration for half-maximal response (EC50), the equilibrium dissociation constant (Kd) and the inhibition constant (Ki), all pertain to in vitro assays run under closed system conditions, in which the drug molecule and target are present at invariant concentrations throughout the time course of the experiment [1 | 2 | 3 | 4 | 5]. However, in living organisms, the concentration of drug available for interaction with a localized target macromolecule is in constant flux because of various physiological processes.”

I used to be highly skeptical about the argument that equilibrium binding metrics relevant are not relevant in open systems in which the drug concentration varies with time. The key question for me was always how the rate of change in the drug concentration compares with the rate of binding/unbinding (if the former is slower than the latter then the openness of the in vivo system would seem to be irrelevant). I also used to wonder why an equilibrium binding measurement made in an open system (e.g., Kd from isothermal titration calorimetry) should necessarily be more relevant to the in vivo system than an equilibrium binding measurement made in a series of closed systems (e.g., Ki from an enzyme inhibition assay). Nevertheless, I always needed to balance my concerns against the stark reality that the journal impact factor of Nature Reviews of Drug Discovery is a multiple of my underwhelming h-index.

Any residual doubts about the relevance of residence time completely vanished recently after I examined a manuscript by Prof Maxime de Monne of the Port-au-Prince Institute of Biogerontology who is currently on secondment to the Budapest Enthalpomics Group (BEG). The manuscript has not yet been made publicly available although, with the help of my associate ‘Anastasia Nikolaeva’ in Tel Aviv, I was able to access it and there is no doubt that this genuinely disruptive study will forever change how we use AI to discover new medicines.

Prof de Monne’s study clearly demonstrates that it is possible to manipulate off-rate independently of on-rate and dissociation constant, provided that binding is enthalpically-driven to a sufficient degree. The underlying mechanism is back-propagation of the binding entropy deficit along the reaction coordinate to the transition state region where the resulting unidirectional conformational changes serve to suppress dissociation of the ligand. The math is truly formidable (my rudimentary understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the polarizability tensor before applying the Barone-Samedi transformation for hepatic eigenvalue extraction. ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which a beaming BEG director Prof Kígyó Olaj explains, “Possibilities are limitless now that we have consigned the tedious and needlessly restrictive Principle of Microscopic Reversibility to the dustbin of history".