Tuesday, 21 April 2026

Comparing ML models in small molecule drug discovery

To start the post I'll share a photo that I took in 2012 of incense sticks at the Truc Lam pagoda near Da Lat. Not long after taking this photo I lost a lens cap (although thankfully not the lens) riding a luge through a forest and would later visit a cricket farm (this was particularly welcome because I had developed a taste for fried crickets during a visit to Cambodia in 2005).

  

I’ll be reviewing A2025 (Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery) in this post. I consider the issues addressed by the authors to be extremely important and I think that the credibility of the Machine Learning (ML) field would be greatly enhanced if Editors declared words like 'outperform' to be verboten in manuscripts submitted to their journals. However, I will make a couple of criticisms of the study. First, ML modellers need to properly account for the number of adjustable parameters used to fit training data (the S2006 study goes further than this by arguing that one should also account for size of the descriptor pool). Second, ML modellers need to recognize that cross-validation can make optimistic assessments of model quality when there is high degree of clustering in training data. I’ll point you toward earlier Molecular design blog posts (Sep2024 | Oct2024 | Jul2025) that may be relevant to the discussion. As is usual for posts here at Molecular Design quoted text is indented with my comments italicised in red.

The ML models that form the focus of the A2025 study aim to predict properties (more generally behaviour) of compounds from their chemical structures. Although there is currently a lot of hype around ML models for drug  discovery it’s worth bearing mind that people have been building quantitative structure-activity/property (QSAR/QSPR) models for decades (the inaugural EuroQSAR conference was held in Prague a mere five years after Czechoslovakia had been invaded by forces from the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). As I see it QSAR/QSPR approaches never really made much of a splash in real world drug discovery and my challenge to those who tout ML models as a panacea for the ills of Pharma/Biotech would be to ask why they think it’s going to be any different this time.

One of the difficulties that QSAR/QSPR practitioners faced when working within drug discovery project teams was that projects had often delivered (or had been put out of their misery) by the time there was enough data to build predictively useful models. It’s also worth pointing out that drug discovery teams have frequently delivered (and continue to deliver) clinical development candidates without ever having sufficient data for building usefully predictive QSAR/QSPR models. Something that that many QSAR/QSPR practitioners never seemed to get is that much drug design is actually hypothesis-driven (I discussed this point 16 years ago in K2009 and I’ll point you to the P2012 article by former colleagues).  A significant part of hypothesis-driven drug design is identification of exploitable features in structure activity/property relationships (SARs/SPRs) such as activity cliffs and instances of increased polarity not resulting in loss of potency.  A simple plot of potency against lipophilicity might not be predictively useful but it can be still used to quantify the extent to the potency of the compound beats the trend in the data (see ‘Alternatives to ligand efficiency for normalization of affinity’ section in NoLE). My view is that hypothesis-driven drug design actually fits very naturally into an AI framework and those who tout AI as a drug design panacea appear to be missing a trick by seeing drug design as essentially an exercise in prediction.

Many of the properties of compounds of interest to ML modellers in drug discovery can be modelled as if they are equilibrium constants or rate constants (continuous-valued, dimensioned quantities) and typically fall into three general categories: 

  1. In vitro bioactivity is usually quantified in terms of potency (concentration at which a compound exhibits a specified effect in bioactivity assay) and, despite the views expressed in a rather bizarre JMC Editorial (a recent JMC Perspective provides a useful counterview and this blog post is also relevant), is the most important of the properties because you can’t compensate for inadequate potency by increasing quality of compounds or by making them more beautiful (see B2012) and I touch on this point in a recent blog post. It is important that ML modellers be aware that for some ‘new’ modalities such as irreversible covalent inhibition and targeted protein degradation the effect of a compound on the target depends on time as well as concentration. I discuss some of the issues that you need to think about when combining potency and affinity data for ML modelling of bioactivity in this blog post.   
  2. Properties considered to be relevant to ADME (absorption, distribution, metabolism, and excretion) include lipophilicity, aqueous solubility, permeability (both passive and active efflux) and plasma protein binding. While these properties are often described collectively as a compound's 'ADME profile' it's not actually accurate to do so because the ADME acronym refers to behaviour of compounds in vivo. Lipophilicity is the single most fundamental physicochemical property in drug design and it’s very important that ML modellers be aware that it's log D, rather than log P, that is measured and that the choice of octanol/water for log D measurement is entirely arbitrary.
  3. Toxicity is typically assessed by measuring potency against anti-targets such as hERG and CYPs and cell-based assays are often used for assessment of toxicity. Generally it is more difficult to find suitable assay data for ML modelling of toxicity than is the case for modelling bioactivity against potential therapeutic targets. One reason for this is that responses in the cell-based assays commonly used to assess toxicity can't generally be linked to engagement of specific anti-targets (this is not to deny the value of the information provided by the assays for decision-making by drug discovery scientists). Furthermore, observations of potency in toxicity assays are likely to steer project teams away from the associated chemotypes and so it is very unlikely that ML modellers will encounter datasets for individual structural series with sufficient variance for building models.      

When modelling properties of compounds that you believe to be relevant to small molecule drug discovery it’s important to bear in mind that even with a complete set of measured properties available it’s not generally feasible to predict what will happen when compounds are dosed in vivo. One reason for this is that the therapeutic (and adverse) effects of a drug are driven by its concentration at its site(s) of action which is a time-dependent quantity that cannot generally be measured in live humans. I argue in NoLE that the objective of the ADME-based aspects of drug design is actually to achieve controllability of exposure and one article that I recommend to all drug discovery scientists and chemical biologists is SR2019 (Intracellular and Intraorgan Concentrations of Small Molecule Drugs: Theory, Uncertainties in Infectious Diseases and Oncology, and Promise).

A number of assays are available for measuring properties of interest in drug discovery and management of the ‘assay budget’ for projects is an important activity in drug discovery (especially when running assays is an outsourced activity). Drug discovery scientists typically use assays to identify and address specific design issues such as low solubility or unacceptable binding affinity for anti-targets.  

In vitro assays used in drug discovery are generally configured for decision-making, rather than for building ML models, and in some cases what some might refer to as the ‘quality’ of the assay might be traded off against throughput (this doesn’t mean that the assays are somehow ‘bad’). In vitro drug discovery assays generally have both lower and upper quantitation limits and an assay’s dynamic range (you can draw an analogy between assays and analytical instruments) is given by the difference between the two values. Needless to say it is very important that ML modellers be fully aware of the lower and upper quantitation limits in the assays used to generate the data from which they will build models. This generally requires careful examination of assay details which might not have been captured by the curation processes used for databases such as ChEMBL (nor even been disclosed in the original publications). For example, maximum potency that can be quantified in a conventional enzyme inhibition assay is limited by the concentration of enzyme in the assay (see WM1979) and you’ll still need a 5 nM concentration of a picomolar inhibitor to achieve 50% inhibition of enzyme that is present in the assay at a concentration of 10 nM. I generally advise ML modellers to carefully examine the distributions in the datasets that they are modelling for evidence of cut offs that might indicate quantitation limits in the assays used to generate the data. 

The effects of a drugs in vivo are typically driven by its unbound concentration in plasma and assays for properties of interest in drug discovery are generally run in buffered aqueous media. It is well-known that measured values for physicochemical properties such as log D and aqueous solubility generally vary with pH for compounds with ionizable groups in their chemical structures. However, values measured for these properties can, in some scenarios, also depend on both the nature and concentration of counter-ion(s). This becomes an issue for log D measurement in cases where significant proportions of compounds are present in the organic phase in ionized forms and for  aqueous solubility measurement when the measured value is limited by the solubility of a salt form (opposed to the neutral form). Dependence of measured property values on the nature and concentration of counter-ions is likely to be more of an issue when the degree of ionization (in aqueous media) is relatively high and my default advice is to consider pKa when models underpredict log D or overpredict aqueous solubility values.

Before addressing what I consider to be the main problems with A2025 I’ll make some specific comments on the study. While these comments might appear to be pedantic (some might even use the term ‘nit-picking’) I would argue that the authors have raised the bar for themselves by claiming that their proposed “guidelines, accompanied by annotated examples using open-source software tools, lay a foundation for robust ML benchmarking and thus the development of more impactful methods”.  By way of an example, if you're trying to persuade an analytical chemist to modify an aqueous solubility assay to make it more suitable for generating data to build ML models then it's not such a great idea to describe aqueous solubility as a molecular property or to confuse the range in a data set with the dynamic range of the assay used to generate the data.    

In the Introduction (Section 1) the Authors state:

In drug discovery, expensive and time-consuming experiments are used to profile molecules [While it is common for drugs to be described as ‘molecules’, especially in promotional material, I generally recommend that ‘molecule’ not be used as a synonym for ‘compound’ in articles with a cheminformatic (or indeed a chemical) focus.] and gain insights into their therapeutic potential. Such experimental assays are typically organized in a cascade, where subsequent experiments test fewer molecules at a higher cost per molecule. As in silico surrogates to such experiments, both regression and classification Machine Learning (ML) models can be trained to estimate molecular properties [These are properties of compounds, as opposed to molecules, and should neither be described as ‘molecular properties’ nor as ‘small molecule properties’.] (i.e., experimental results) from chemical structure. Such models could inform drug design and prioritize experiments by scoring a set of candidate molecules. [The term ‘candidate molecules’ is as clumsy as it is inaccurate, and its meaning will not be clear to some readers. I recommend that the term ‘chemical structures’ be used instead.] These ML models thus inform high-stakes decisions [The ML models that are the focus of this study inform decisions as to which compounds should be synthesized and these decisions would not automatically be considered to be high-stakes decisions in contemporary drug discovery given developments in automation and high-throughput synthetic chemistry. It’s also important to be aware that in real life drug discovery many decisions to synthesize compounds are made with the knowledge that structural analogs have already been synthesized and shown to be active against the targets of interest. I would argue that genuinely high-stakes decisions, such as prioritization of compounds for in vivo studies, are only made after compounds have actually been synthesized and evaluated in relevant in vitro assays.] and help drug discovery research progress more quickly and efficiently. Hence, it is important that models provide reliable forecasting of experimental results.

In Section 3.3.1.3 (Dynamic Range) the Authors state:

Both correlation and error metrics are influenced by the dynamic range of the data being modeled. [I consider this use of the term ‘dynamic range’ to be incorrect and, as a reviewer, I would have pressed the Authors to explain the difference between the range of a data set and its dynamic range. As noted earlier I see dynamic range as a characteristic of an analytical instrument or an assay (which can be considered to be a type of analytical instrument) and I would argue that the term should not be applied to data sets. That said, it may be possible to infer the dynamic range of an assay through careful examination of the data.]  Achieving a high correlation on data sets with a broader range of experimental values is generally easier, whereas data sets with a smaller dynamic range can produce unrealistically small values for error metrics. [While the range of a data set certainly imposes limits on variance it’s important to remember that measures of correlation are defined in terms of variance (as opposed to range) of the data. For a data set to be useful for building ML models the variance for replicate measurements needs to be small in comparison with the overall variance for the data set.] This can lead to deceptive conclusions.

With the pedantry (or nit-picking if you prefer) out of the way it’s time to take a look a what I consider to be the principal flaws of A2025. First, I consider it important to account for the number of adjustable parameters used to fit training data and, at very least, the authors should have acknowledged this as an issue.  Second, I have concerns that cross-validation can lead to optimistic assessment of model quality when there is a high degree of clustering in training data and the a post from last year July might be relevant.

It’s well known that you can achieve a better fit to your data by simply using more adjustable parameters (I recommend that all ML modellers take a look at H2004 (DM Hawkins, The Problem of Overfitting, JCICS 2004 44:1-12) and my position is that it’s generally not meaningful to compare performance for models that differ in the number of adjustable parameters used to fit the training data without properly accounting for numbers of adjustable parameters. A criticism that I was making of the QSAR/QSPR field many years ago (long before ML modelling came to be touted as a panacea for the ills of Pharma/Biotech) was that many of those building models appeared to dismiss the accounting for numbers of adjustable parameters as a non-issue. It’s worth noting that building ML models typically involves selection of a subset of descriptors from a larger pool and the S2007 study argues that you also need to account for the number of descriptors in the pool when assessing model quality. Accounting for the number of adjustable parameters is not just an issue when you’re building ML models for small molecule drug discovery and this point is made in MHG2017 (Mardirossian and Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Molecular Physics, 115 2315–2372):

With semi-empirical density functionals, a measure that is commonly reported upon publication is the total number of parameters. Existing functionals based on the B97 concept have anywhere between 5 and 75 parameters. However, counting the number of parameters is often a confusing and unclear task.

The need to properly account for the number of adjustable parameters (the term 'degrees of freedom' is also used, especially in the older literature) when modelling data has been actually been recognised for many years. The agrarian economist Mordecai Ezekiel (1899-1974), who shaped much of FDR’s agricultural policy, introduced adjusted R2 (link1 | link2) in Methods of Correlation Analysis which was published in 1930. The F-test (link1 | link2) can be used to assess whether the use of additional adjustable parameters is justified although I’m not aware of exactly when this particular use of the F-test was introduced.  It’s also worth pointing out that Akaike information criterion (AIC) and Bayesian information criterion (BIC) appeared in the statistics literature in 1974 and 1978 respectively. I certainly wouldn’t claim to have comprehensively reviewed the importance of accounting for number of adjustable parameters when comparing ML model performance nor am I suggesting that this is something that would be easy to do. Nevertheless, I do hope that it's clear that this is not something that can simply be swept under the carpet (or even ejected from the window of an upper floor Moscow apartment).

This is a good point at which to say something about validation of ML models and I would argue that is actually very difficult to demonstrate objectively that one protocol for validation is better than another. Two general approaches for validation of ML models are to use cross-validation and to split data into a training set and an external test set (that the model never sees). A view that I’ve held since the late 1990s is that many ‘global’ models for predicting properties of compounds relevant to drug discovery are actually ensembles of local models (this view was expressed publicly in the B2009 study). I would anticipate that clustering in data sets will cause cross-validation to give optimistic assessments of model quality which in turn can lead to overfitting. I would also expect principal component analysis (PCA) to be less meaningful for highly clustered data (this is relevant because correlations between chemical structure descriptors need to be accounted for in order to calculate meaningful distances between chemical structures in the space). Something that I do need to make clear is that ‘clustering’ in the context of this post simply refers to distribution within the chemical structure descriptor space of a model.

The Authors of A2025 recommend "using a 5 × 5 repeated cross-validation procedure to sample the performance distribution” and one point that I’ll make is that they haven’t demonstrated that this protocol is more effective than 4 × 4 repeated cross-validation or 6 × 6 repeated cross-validation. While this might appear to be nit-picking I will point out that it would not be valid to invoke A2025 if criticising a future  ML modelling study for using 4 × 4 repeated cross-validation (bear in mind that a substructural match against even a single PAINS filter would be considered by some to constitute the basis for a valid criticism in medicinal chemistry and K2017 might be of interest in this context).

The general approach to cross-validation is to repeatedly split the data into training sets and test sets before assessing how well on average the test data are predicted (algorithms differ as to exactly how this is done). When there is a high degree of clustering the data splits are likely to retain some members for each cluster in the training sets which can ‘anchor’ the models. Here’s what H2004 has to say: 

If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead’ compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.

Another approach to validating ML models is to use external test sets although this can still lead to optimistic assessments of model quality when the available data are highly clustered. One advantage of this approach to validation is that external test sets can be ‘structured’ to provide a more detailed view of model performance (one criticism that I would make of cross-validation is that it gives a rather ‘one-dimensional’ assessment of model performance). One way to structure test sets is to characterize (by size and closeness) the neighbourhood within the training set for each object in the test set. The motivation for structuring the test sets in this manner is that it enables you to analyse relationships between prediction performance and the degree of coverage of space around test set objects by training set data. There are, however, other ways to structure test sets and my view is that classifying test set compounds according to whether they are neutral, cationic or anionic would potentially be informative when assessing models for log D, aqueous solubility, permeability, plasma protein binding, volume of distribution and hERG blockade. Although it’s not directly relevant to this post I would generally recommend that ML model predictions be presented to users along with training set data for the nearest neighbours in the model space and the most similar chemical structures in the training set.

This is a good point at which to wrap up and I concede that it’s difficult to account for numbers of adjustable fitting parameters and to meaningfully validate models when distributions of objects within the relevant chemical spaces are very uneven. That said, I would argue that creators of ML models do at least need to acknowledge these issues given that many tout models like these as essential for AI-based drug design.

Anticipating a future blog post on chemical space coverage I'll finish the post by noting that coverage is also of historical relevance. The B-52 in the photo is not in the best state of repair and this shouldn't surprise you because I took the photo during a 2005 visit to Hanoi. In those days it was considered to be good form to show disrespect for the enemy's military hardware and so I gave the wreckage a good kick. I also paid my respects to Uncle Ho whom I’m told is in much better shape than Chairman Mao (owing to the then frosty Sino-Soviet relations the latter was pickled by inexperienced compatriots rather than by the Russian experts who had pickled the former and it is said that the embalming team arrived from Moscow before Uncle Ho had actually expired). A few days later in Dien Bien Phu I caused a minor consternation by demonstrating that that the barrel of an American-made 155 mm howitzer that had been captured from the French in 1954 could still be elevated (admittedly it was a little stiff). Apparently, the French had asked the Americans if they would be so kind as to drop lots of bombs (or perhaps one very big bomb) on the Viet Minh but President Eisenhower wisely denied the request. The B-52 in the photo was one of a number sent by President Nixon (who had been President Eisenhower’s VP) to bomb North Vietnam during Operation Linebacker II (aka the Christmas Bombings) and it's my understanding that all crew members survived their encounter with the SAM.


 

Wednesday, 1 April 2026

PAINS and Prejudice

<< previous || next >>

PAINS (pan assay interference compounds) filters have exerted a hold over the drug discovery community ever since the BH2010 study appeared over 15 years ago. Initially I didn’t take much notice of PAINS filters and, in any case, I’d already moved on from analysis of high-throughput screening (HTS) output by that point (I might add ‘thankfully’ because looking at too much HTS output is a sure-fire route to the funny farm). I started analysing HTS output from about 1993 at what was then Zeneca. I used the Daylight toolkit to create the Struct_Anal SMARTS-based chemical structure profiler in 1995 and, at that time, we were already using in house software named Flush (even at that stage it was clear that much of the HTS output being generated was going to disappear round the S-bend and our friends at what was then Rhône-Poulenc Rorer developed HARPick to ensure that nothing remained stuck to the porcelain).

Photo from 2011 at 'The Black Hole' (Los Alamos NM)

Something that had always worried me was that it was very easy to opine that a compound looked nasty but it was much more difficult to demonstrate objectively that the compound was indeed nasty. Late in 2014 a blog post, which fell well short of the standards that the drug discovery community has come to expect from Practical Fragments, prompted me to take a more forensic look at PAINS filters. What I found was that PAINS filters were based on the output from screening compounds in just six AlphaSceen assays (if a panel of six assays that all use the same read-out strikes you as suboptimal design of an experiment to detect pan-assay interference then you’re not alone). After blogging periodically about PAINS filters for a couple of years I wrote a Perspective on the topic (as noted in this blog post: from time to time, every blogger should write a journal article “pour encourager les autres”).

Nevertheless, doubts about the correctness of my position started to creep in when I was denounced for being insufficiently thoughtful in my published comments on PAINS by the authors, one of whom is a former colleague, of the seminal, insightful and Nobel-worthy ‘Seven Year Itch’ article (BN2017) which oozes wisdom and penetrating insight. Although stung by the criticism and wracked by self-doubt to the extent that I considered therapy, it was a recent study led by the world-renowned expert on tetrodotoxin pharmacology, Prof. Angelique Bouchard-Duvalier of the Port-au-Prince Institute of Biogerontology, working in collaboration with the Budapest Enthalpomics Group (BEG), that removed any lingering doubts about the sublime elegance and extreme predictivity of PAINS filters. The manuscript has not yet been made publicly available although I was able to access it with the help of my associate ‘Anastasia Nikolaeva’ (not sure exactly what she’s doing these days although I understand that she’s currently visiting Port-au-Prince for a medication review with Prof. Bouchard-Duvalier). There is no doubt that this genuinely disruptive study will comprehensively reshape the predictive biochemistry landscape, enabling drug discovery scientists to accurately, meaningfully and robustly predict assay interference using only chemical structures as input for the very first time.

Prof. Bouchard-Duvalier’s seminal study clearly demonstrates that singlet oxygen quenching is actually a conserved feature for all known and unknown mechanisms of interference with assay read-outs and that PAINS filters dramatically outperform all other methods for prediction of assay interference. The math is truly formidable (the rudimentary nature of my understanding of Haitian patois didn’t help either) and involves first projecting the atomic isothermal compressibility matrix into the quadrupole-normalized polarizability tensor before applying the Barron-Samedy transformation, followed by hepatic eigenvalue extraction using a the elegant algorithm devised by E. V. Tooms (a reclusive Baltimore resident and connoisseur of liver pâté whose illustrious thought leadership of the analytic topology field unravelled almost 32 years ago after he failed to comply with the safety instructions for an escalator). The incisive analysis of Prof. Bouchard-Duvalier shows without a shadow of doubt that singlet oxygen quenching as quantified by the AlphaScreen assay read-out is a fundamental principle in biomolecular assay science. Furthermore, ‘Anastasia Nikolaeva’ was also able to ‘liberate’ a prepared press release in which the grinning BEG director Prof. Kígyó Olaj explains: 

Possibilities are limitless now that we can accurately and robustly predict the assay interference that compounds will exhibit directly from their chemical structures and we can safely consign experimental biochemical assays to the dustbin of history. Surely the Journal of Medicinal Chemistry Editors will now finally recognize the colossal impact that PAINS filters have made on real world drug discovery and development when they make their FIFA Prize nominations later this year.

 

Wednesday, 31 December 2025

Hit to Lead best practice?

I'm now in Trinidad and I'll share a 180° panorama from Paramin where I walk for exercise. This district in Trinidad's Northern Range is renowned for its agriculture and the most excellent produce is grown in 'gardens' on steep hillsides. My walk would take about two and a quarter hours if I just walked but it usually takes rather longer because I like to take photos and often stop on the ridge to gaze at corbeaux 'surfing' the updrafts. Most of all I enjoy catching up with friends in Paramin and not so long ago one of them was telling me about the sound made by douens (which have terrified me since childhood because I was never baptised). Some years ago I was struggling along the ridge with a hacking cough that I'd brought with me from the UK three days previously when I heard a familiar voice (one of my friends was visiting his sister). The conversation turned to my cough and he instructed his sister to bring some medicine. She produced a bottle of a liquid that looked like fluorescein and, as she decanted some into a shot glass my friend exclaimed "dat too much yuh go kill he". The liquid appeared to have a puncheon base and my friend's sister also gave me some bush to make tea. My cough was history after three days.             


I’ll be taking a look at The European Federation for Medicinal Chemistry and Chemical Biology (EFMC) Best Practice Initiative: Hit to Lead (Q2025) in this post. I have a number of criticisms of this work and it really shouldn’t need saying that you do raise the bar for yourself when you present your work as defining best practices. As is customary for blog posts here at Molecular Design I’ve used Q2025 reference numbers when referring to literature studies and quoted text is indented with my comments in red italics. This will be a long tedious post and strong coffee is recommended.

Best practices are, in essence, recommended ways of doing things and it’s actually very difficult to demonstrate objectively that one way of doing things is better (or worse) than another way. My general view of Q2025 is of a poorly organized article that at times lacks clarity and coherence. Some of the advice offered on how best to do Hit to Lead (H2L) work is unsound and the Authors also make a number of significant errors. Although the abstract refers to “contemporary drug discovery” the recommended best practices do, in my view, appear to be firmly rooted in the past given that that fragment-based design (FBD) is not covered and there is no mention of important 'new' modalities such as irreversible covalent inhibition and targeted protein degradation. It’s worth mentioning that biological activity for some new modalities cannot be meaningful quantified as a single parameter such as an IC50 value and this complicates the use of ligand efficiency metrics (a post on covalent ligand efficiency will give you an idea of the tangles you can get yourself into) which the Authors seem to consider important in H2L work. I consider the quantity of literature cited in  Q2025 to be excessive, especially given that some of the cited articles have minimal relevance to H2L work (the failure of the Authors to cite R2009 is also noteworthy). In some cases the cited literature does not support assertions made by the Authors. In my view Figures 1, 5 and 8 are redundant.

While I see plenty wrong with Q2025 it’s worth flagging up points on which the Authors and I appear to be in agreement. I think that they put it well with the following statement: 

Leads have line of sight to a development candidate and bring an understanding of what priorities Lead Optimisation should address.

I used this football analogy in an earlier post:

The screening phase is followed by the hit-to-lead phase and it can be helpful to draw an analogy between drug discovery and what is called football outside the USA. It’s not generally possible to design a drug from screening output alone and to attempt to do so would be the equivalent of taking a shot at goal from the centre spot. Just as the midfielders try move the ball closer to the opposition goal, the hit-to-lead team use the screening hits as starting points for design of higher affinity compounds. The main objective in the hit-to-lead phase is to generate information that can be used for design and mapping structure-activity relationships for the more interesting hits is a common activity in hit-to-lead work.

I certainly agree that it is important to establish structure-activity relationships (SARs) for structural series of interest although I have no idea what the Authors mean by “dynamic SAR”. I also agree that consideration of physicochemical properties, especially lipophilicity, is very important in H2L work (just as it is in optimisation of the leads) although the case for a Nobel Prize made in a 2024 JMC Editorial does, in my view, appear to have been overcooked.

I argue that drug discovery should be seen in a Design of Experiments framework (generate the information that you need as efficiently as possible) rather than as the prediction exercise that many who tout machine learning (ML) as a panacea for the ills of Pharma & Biotech would have you believe. Regardless of which view prevails it’s abundantly clear that generation and analysis of data are very important in contemporary drug discovery and are likely to become even more important in the future). However, if you’re going to base decisions on trends in data then it’s important that you know how strong the trends are because this tells you how much weight to give to the trends when making your decisions. Most drug discovery scientists will have encountered analyses of relationships between predictors of ADME (absorption, distribution, metabolism, and excretion) and physicochemical and chemical structure descriptors and we observed in the KM2013 perspective that:

The wide acceptance of Ro5 provided other researchers with an incentive to publish analyses of their own data and those who have followed the drug discovery literature over the last decade or so will have become aware of a publication genre that can be described as ‘retrospective data analysis of large proprietary data sets’ or, more succinctly, as ‘Ro5 envy’.

In some cases trends observed in data are presented in ways that make them appear to be stronger than they actually are (this is typically achieved by categorizing continuous-valued data prior to analysis) and [13a], [24] and [26] were criticised in this context in KM2013. When reading articles on drug-likeness and compound quality it is also important to be aware that correlation does not imply causation.  One should be particularly wary of of studies such as [20c] which present analyses of proprietary data as "facts" or claim that such analyses have revealed "principles". I see the weakness of these trends partly as a reflection of chemical structure diversity in datasets and would expect the corresponding trends to be stronger within structural series (I offer the following advice in NoLE):

Drug designers should not automatically assume that conclusions drawn from analysis of large, structurally-diverse data sets are necessarily relevant to the specific drug design projects on which they are working.

I see erosion of critical thinking skills as a significant problem in contemporary drug discovery and some leaders in the field appear to have lost the ability to distinguish what they know from what they believe. As I observed in a review of a 2024 JMC Editorial (Property-Based Drug Design Merits a Nobel Prize) the Rule of 5 (Ro5) is not actually supported by data in the form that it was stated. The wide acceptance of Ro5 as a definition of drug-likeness propagates what I consider to be a misleading view that drugs occupy a contiguous and distinct region of chemical space. Some of the claims made in the JMC Editorial (“a compound is more likely to be clinically developable when LipE > 5”, “a discovery compound is more likely to become a drug when Fsp3 > 0.40” and “a compound is more likely to have good developability when PFI < 7”) do not appear to be based on data. I remain sceptical that developability and likelihood of clinical success of a compound can be meaningfully assessed even when one knows that the compound actually exhibits exploitable activity against the target(s) of interest. In my view the suggestion that simple drug discovery guidelines are worthy of a Nobel Prize does a huge disservice to drug discovery scientists by trivializing the very significant challenges that they face.   

Like many in the drug discovery field, I consider lipophilicity to be the single most important physicochemical property in drug discovery and I would generally anticipate that a surfeit of lipophilicity will end in tears. That said, I don't consider lipophilicity to be usefully predictive of physicochemical properties such as permeability and aqueous solubility that are more relevant than lipophilicity from the perspective of oral absorption. When I assert that lipophilicity is not "usefully predictive" I'm certainly not denying that trends in data exist. However, I must stress that the trends are not so strong that having solubility values that have been predicted from lipophilicity means that you no longer need to measure aqueous solubility.    

In drug discovery projects I generally recommend examination of the response of potency (expressed as a logarithm) to increased lipophilicity. In the ideal situation the correlation of potency with lipophilicity will be weak, indicating that potency is driven by factors other than lipophilicity. If the correlation of potency with lipophilicity is strong then you need the response (the slope for a linear correlation) to be relatively steep. I consider it to be generally helpful to plot potency against lipophilicity with reference lines corresponding to different LipE values (see R2009 which is a lot more relevant to H2L work than much of the literature cited in the Q2025 study) and I would also suggest modelling the response and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the "Alternatives to ligand efficiency for normalization of affinity" section of NoLE).

In drug discovery lipophilicity is usually quantified by the logarithm of the octanol/water partition coefficient (log P) or distribution coefficient (log D). The choice of octanol/water for quantification of lipophilicity is arbitrary and some, including me, consider saturated hydrocarbons such as cyclohexane or hexadecane to be physically more realistic than octanol as a model for the core of a lipid bilayer. It is the distribution coefficient (D) rather than the partition coefficient (P) that is measured for lipophilicity assessment although the two quantities are equivalent when ionization can be safely neglected. Values of logP for ionizable compounds can be derived from the response of log D to pH although this is not generally done routinely in in drug discovery. Alternatively, you can make the assumption that only neutral forms of compounds partition into the organic phase and use (1) in the H2L best practices post graphic (see also K2013) to convert log D values to log P values (to do this you’ll also need a reliable estimate for pKa in order to calculate the neutral fraction). When log D (as opposed to log P) is used to assess the ‘quality’ of compounds you can make compounds better simply by increasing the extent to which they are ionized and I hope you can see that going down this path is likely to end as well as things did for the Sixth Army at Stalingrad.


In drug discovery log P values are typically calculated and it can often be quite difficult when reading the literature to know which method has been used for the calculations (sometimes the term ‘cLogP’ appears to have been used simply to denote that log P values have been calculated).  For example, it is stated in [13a] that “Physical property data were obtained from AstraZeneca’s C-Lab tool, incorporating standard packages for LogP calculations (cLogP, ACDLogP), and an in-house algorithm for the distribution coefficient (1-octanol–water LogD at pH 7.4)”. In general, different prediction methods will give different log P values for the same compound (for example the Ro5 lipophilicity threshold is 5 when ClogP is used but 4.15 when MlogP is used). That said, choice of method for predicting log P and whether you use measured log D or predicted log P become less important issues when working within structural series because hydrogen bond donors and acceptors, and ionizable groups tend to be relatively conserved under this scenario.

That log D and log P are different quantities in the context of drug design is one of a number of things that the Authors of [34a] (Molecular Property Design: Does Everyone Get It?) just don’t seem to ‘get’ and I’ll point you toward a blog post in which this point is discussed in a bit more detail. Let’s examine Figure 2 (Impact of hydrophobicity on developability assays and the profile of marketed oral drugs) of [34a] and I’d like you to look at the upper panel (a). You’ll notice that the visualization for some of the ‘developability’ assays is based on PFI (derived from log D measured chromatographically at pH 7.4). However, the visualization for hERG (+1 charge) and promiscuity is based on iPFI (derived from ‘Chrom logP’ and it is not clear how this quantity was defined or generated). I would also argue that the activity criterion (pIC50 > 5) used in the promiscuity analysis is too permissive to be physiologically relevant (this is a common issue in the promiscuity literature). As an aside, I am unconvinced that log D values were actually measured chromatographically at pH 7.4 for all the drugs that form the basis of the analysis shown in the lower panel (b) of Figure 2.        

After a long preamble it’s time to start my review of Q2025 and comments will follow the order of the article. I see the citation of [2] and [3] as gratuitous while [4] does not appear to present evidence in support of the claim that “ensuring high quality of lead series is a large cost and time saver in the overall process of drug discovery” (it must be stressed that I certainly don’t deny the value of high quality lead series and am merely pointing out that the chosen reference does not actually demonstrate that higher quality of lead series result in cost and time savings in drug discovery).

In my view neither Figure 1 nor its caption (see below) makes any sense.

Figure 1. Illustration of the multi-objective characterisation necessary in the journey from a hit to a drug. All these necessary characteristics, described by illustrative principal components, are influenced by the physicochemical properties of the molecules.

You’ll frequently encounter graphics like Figure 1 that show low-dimensional chemical spaces in the drug discovery literature (for example, a 2-dimemsional space might be specified in terms of lipophilicity and molecular size). While it’s very easy to generate graphics like these the relevance of the chemical spaces to drug design is often unclear. There are ways in which you can demonstrate the relevance of a chemical space to drug design and, for example, you might build usefully predictive models for quantities such as IC50, aqueous solubility or permeability using only the dimensions of the particular chemical space as descriptors. Alternatively, you could show that compounds in mutually exclusive categories such as ‘progressed to phase 2’ and ‘failed to progress to phase 2’ occupy different regions of the chemical space (note that it’s not sufficient to show that a single class of compounds such as ‘approved drugs’ occupies a particular region within the chemical space and this is the essence of a general criticism that I make of Ro5 and QED). It is common to depict the different categories as ellipses that enclose a given fraction of the data points corresponding to each category and the orientation of each ellipse with respect to the axes indicates the degree to which the descriptors that define the chemical space are correlated for each category. One problem with Figure 1 is that the meaning of the ellipses is unclear and I would challenge the assertion made by the Authors that “the journey of a drug discovery campaign is characterized in Figure 1, showing how the active hit needs to be modified to address the requirements impacting the efficacy and safety of the molecule”.

Potency optimisation alone is not a viable strategy towards the discovery of efficacious and safe drugs, or even high-quality leads. Concurrent optimisation of the physicochemical properties of a molecule is the most important facet of drug discovery, as these properties influence its behaviours, disposition and efficacy [12a | 12b]. [While I certainly agree that there is a lot more to drug design than maximisation of potency I would argue that controlling exposure is a more important objective than optimization of physicochemical properties (on the subject of exposure I recommend that all drug discovery scientists take a look at the SM2019 article). It's also worth bearing in mind that you can't compensate for inadequate potency with increased compound quality. I don't consider either reference as evidence that "concurrent optimisation of the physicochemical properties of a molecule is the most important facet of drug discovery" and it is not accurate to describe metabolic stability, active efflux and  affinity for anti-targets as "physicochemical properties".  I think the Authors need to say more about which physicochemical properties they recommend to be optimized and be clearer about exactly what constitutes optimization. Lipophilicity alone is not usefully predictive of properties such as bioavailability, distribution and clearance that determine the effects of drugs in vivo.] Together these outcomes define the quality of the molecule, indicative of its chances of success in the clinic, as evidenced in numerous studies [13a | 13b]. [Neither of these articles appears to provide convincing evidence of a causal relationship between “the quality of a molecule” and probability of success in the clinic.  Much of the 'analysis' in [13a] consists of plots of median values without any indication of the spreads in the corresponding distributions and to see it cited in connection with "evidenced" rings alarm bells for me. As explained in KM2013 presenting data in this manner exaggerates trends and I consider it unwise to base decisions on data that have been presented in this manner. Quite aside from from the issue of hidden variation I do not consider the relationship between promiscuity and median cLogP reported (Figure 3a) in [13a] to be indicative of probability of success in the clinic, given that the criterion for 'activity' ( > 30% inhibition at 10 µM) is far too permissive to be physiologically relevant (this is a common issue in the promiscuity literature).]

While the optimal lipophilicity range has been suggested as a log D7.4 between 1 and 3, [15] this is highly dependent on the chemical series. [The focus of the analysis was permeability and the range was actually defined in terms of AZlogD (calculated using proprietary in-house software) as opposed to log D measured at 7.4. The correlation between the logarithm of the A to B permeability and AZlogD is actually very weak (r2 = 0.16) which would imply a high degree of uncertainty in threshold values used to specify the optimal lipophilicity range. While I remain sceptical about the feasibility of meaningfully defining optimal property ranges the assertion that the proposed range in AZlogD of 1 to 3 “is highly dependent on the chemical series” is pure speculation and is not based on data.] Best practice would be to generate data for a diverse set of compounds in a series, if measuring it for all analogues is not possible, and determine the lipophilicity range that leads to the most balanced properties and potency [3 | 16]. [It is not clear what the Authors mean by “most balanced properties and potency” nor is it clear how one is actually supposed to use lipophilicity measurements to objectively “determine the lipophilicity range that leads to the most balanced properties and potency”. My view is that to demonstrate "balanced properties and potency" would require measurements of properties such as aqueous solubility and permeability that are more predictive than lipophilicity of exposure in vivo. I do not consider either [3] or [16] to support the assertions being made by the Authors.]  Lipophilicity and pKa prediction models can then guide further designs and synthesis of analogues along the optimisation pathway (Figure 3 [17]). but measurements are advised, particularly by chromatographic methods, such as Chrom log D7.4, in [18] contemporary practice. [In general, it is very difficult to convincingly demonstrate that one measure of lipophilicity is superior to another. Chromatographic measurement of log D is higher in throughput than the shake flask method used traditionally but it is unclear as to which solvent system the measurement corresponds.  Furthermore, the high surface area to volume area of the stationary phase means that an ionized species can interact to a significant extent with the non-polar stationary phase while keeping the ionized group in contact with the polar stationary phase and one should anticipate that the contribution of ionization to log D values might be lower in magnitude than for a shake flask measurement.]

As noted earlier in the post I consider it helpful to plot (as is done in Figure 3 which also serves as the graphical abstract) potency against lipophilicity with reference lines corresponding to different LLE (LipE) values (see R2009 which really should have been cited) to be a good way for H2L project teams to visualize potency measurements for their project compounds. That said, I consider view of the discovery process implied by Figure 3 to be neither accurate nor of any practical value for scientists working on H2L projects. It is relatively easy to define optimization of potency and measurements in an vitro assay are typically relevant to target engagement in vivo (uncertainty in the concentration of the drug in the target compartment, and of the species with which it competes, is likely to be the bigger issue when trying to understand why in vitro potency fails to translate to beneficial effects in vivo). One specific criticism that I will make of the Figure 3 is that it appears to imply that it doesn't matter whether you use log P or log D (when you use log D you can reduce lipophilicity to acceptable levels simply by increasing the extent to which compounds are ionized).    

However, there is quite a bit more to optimization of properties such as permeability, aqueous solubility, metabolic stability and pharmacological promiscuity that are believed to be predictive of ADME and toxicity, and my view is that defining optimization in terms of determining "the lipophilicity range that leads to the most balanced properties and potency" to be hopelessly naive. The principal objective in H2L work (and in lead optimization) is to identify compounds for which potency and properties related to ADME and toxicity are all acceptable. Defining meaningful acceptability criteria is non-trivial and H2L teams also typically need to make decisions as to how criteria can be relaxed with a minimum of risk. It's important to be aware that you can't compensate for inadequate potency by making the other properties better and those who argue that drug discovery scientists should focus on lipophilic efficiency rather than potency are missing this point.

While plotting potency against lipophilicity with reference lines corresponding to different LLE (LipE) values is often a helpful way to visualise project data in H2L (and in lead optimization) I don't consider Figure 3 to provide an accurate or useful view of the typical H2L process. Figure 3 presents a view that a hit maps to a lead which in turn maps to a drug candidate. In reality the screening phase of a discovery project will identify multiple hits and the resulting leads are not single compounds but structural series. It is important to be aware that the practical (as opposed to conceptual) utility of a graphic such as Figure 3 is limited by the extent to which the chosen measure of lipophilicity is predictive of properties such as aqueous solubility, permeability and metabolic stability.  

Although Q2025 claims to define H2L best practices the Authors don't appear to demonstrate much awareness of the nature of the H2L process. The first step in the H2L process is to follow up hits from the initial screen by assaying potential compounds of interest (summarised in Figure 2) although and in some cases some follow up might have already been done in the hit generation phase. Hits tend to group into structural families and the H2L chemists then synthesise compounds (in some organizations synthesis is outsourced) with a view to identifying compounds that are more potent that the hits. Decisions as to which compounds are to be made are typically hypothesis-based (see P2012) although in some cases genuinely predictive models might be available to the H2L team. Design hypotheses are typically based on information available to H2L teams, such as SARs derived from the hits or relevant target structures, and predictive models might be based on free energy calculations (see ASC2025). As the H2L teams generate more information design hypotheses become more specific and models based on project data become more predictive.

I would argue that establishing (and exploiting) SARs and structure-property relationships (SPRs) constitutes a basis for design in H2L work.  Certain features of SARs are especially relevant to H2L work and an observation that a reduction in log P leads to increased potency (or at least a minimal decrease in potency) is information that project teams can make good use of. Other SAR features that I would advise H2L scientists to look for are activity cliffs (relatively small changes in structure result in relatively large changes in potency) and superadditivity (effect on potency of simultaneously making two structural modifications is greater than what would be expected from the effects of making each structural modification individually).  

I see managing the 'assay budget' as a critical activity (especially when running assays is outsourced). For example, differences in lipophilicity between structurally related compounds are typically easy to predict and measuring large numbers of log D values is likely to be wasteful of resources. H2L teams need to use their assay budgets to identify and address issues efficiently and I don't consider the suggestion that H2L teams use a generic tiering approach such as the one shown in Figure 9 to be especially helpful. Something that I do suggest H2L teams consider is to try to assess responses of properties such as aqueous solubility and permeability to lipophilicity (this means making measurements for less potent compounds).                     

Figure 3. There are numerous routes to climb a mountain, as there are to discover a drug, but a measured approach to lipophilicity will guide an optimal path, [The Authors need to articulate what they mean by “a measured approach to lipophilicity” (which does come across as arm-waving) and provide evidence to support their claim that it “will guide an optimal path”.] where the outcome is usually driven by a balance of activity and lipophilicity [This appears to be a statement of belief and the Authors do need to provide evidence to support their claim. The Authors also need to say more about how the “balance of activity and lipophilicity” can be objectively assessed.] (The parallel lines represent LLE, i.e. pIC50 - log P). [This way of visualizing data was introduced in the R2009 study which, in my view, should have been cited.]

Thus the Distribution Coefficient, (log D at a given pH) is a highly influential physical property governing ADMET profiles [20a | 20b | 20c] such as on- and off-target potency, solubility, permeability, metabolism and plasma protein binding (Figure 4) [14b]. [I recommend that the term ‘ADMET’ not be used in drug discovery because ADME (Absorption, Distribution, Metabolism, and Excretion) and T (Toxicity) are completely different issues that need to be addressed differently in design. I would argue that the ADME profile of a drug is actually defined by its in vivo characteristics such as fraction absorbed (which may vary with dose and formulation), volume of distribution and clearance (the Authors appear to be confusing ADME with in vitro predictors of ADME) and I would also argue that toxicity is an in vivo phenomenon. In order to support the claim that log D “is a highly influential physical property governing ADMET profiles” it would be necessary to show that log D is usefully predictive of what happens to drugs in vivo. My view is that the cited literature does not support the claim that log D “is a highly influential physical property governing ADMET profiles” given that  [20a] does not even mention log D and neither [20b] nor [20c] provides any evidence that log D is usefully predictive of in vivo behaviour of drugs.]

Figure 4. The impact of increasing lipophilicity on various developability outcomes [14b] [It is unclear as to whether lipophilicity is defined for this graphic in terms of log P or log D. It would be necessary to show more than just the ‘sense’ of trends for the term “impact” to be appropriate in this context. I do not consider the use of the term “developability outcomes” to be either accurate or helpful.]

Aqueous solubility is certainly an important consideration in H2L work although I think that the Authors could have articulated the relevant physical chemistry rather more clearly than they have done. You can think of the process of dissolution as occurring in two steps (sublimation of the solid followed by transfer from the gas phase to water). Lipophilicity usually features in models for prediction of aqueous solubility although I consider wet octanol to be a thoroughly unconvincing model for the gas phase. We generally assume that aqueous solubility is limited by the solubility of the neutral form (which is why ionization tends to be beneficial) but when this assumption breaks down the solubility that you measure will depend on both the nature and concentration of the counter-ion. As I note in HBD3 optimization of intrinsic aqueous solubility (the solubility of the neutral form of the compound) is still a valid objective for ionizable compounds because we're typically assuming that only neutral species can cross the cell membrane by passive permeation.

Some general advice that I would offer to drug discovery scientists encountering solubility issues is that they should try to think about molecular structures from the perspectives of molecular interactions in the solid state and crystal packing. I would expect the left hand 'Reduce crystal packing' structure in Figure 6 to be able to easily adopt a conformation in which the planes corresponding to the aromatic rings and amide are all mutually coplanar (this is a scenario in which a non-aromatic replacement for an aromatic ring might be expected to have a relatively large impact). In HBD3 I suggest that deleterious effects of aromatic rings on aqueous solubility might be due to molecular interactions of the aromatic rings rather than their planarity. I also suggest in HBD3 that elimination of non-essential hydrogen bond donors be considered as a tactic for improving aqueous solubility because it tends to increase the imbalance between hydrogen bond donors and acceptors while minimizing the resulting increase lipophilicity.      

Rational [this use of "rational" is tautological] reasons for poor solubility were succinctly described by Bergstrom, who coined "Brick Dust and Greaseballs" as two limiting phenomena in drug discovery [22] which are in line with the empirical findings that led to General Solubility Equation [23] (Figure 5). [I don’t consider the General Solubility Equation to have any relevance to H2L work because it has not been shown to be usefully predictive of aqueous solubility for compounds of interest to medicinal chemists and the inclusion of Figure 5, which merely shows how predicted solubility values map on to an arbitrary categorisation scheme, appears to be gratuitous.] Succinctly, three factors influence solubility: lipophilicity, solid state interactions and ionisation. [It is solvation energy as opposed to lipophilicity that influences solubility and wet octanol is a poor model for the gas phase.] Determining which are the strongest drivers of low solubility will guide the optimisation (Figure 6). Using the analysis in Figure 5 the Solubility Forecast Index emerged, using the principle that an aromatic ring is detrimental to solubility, roughly equivalent to an extra log unit of lipophilicity for each aromatic ring (Thus SFI = clog D7.4 + #Ar) [24]. [I consider the use of the term “principle” in this context to to be inaccurate given that that the basis for SFI is subjective interpretation of a graphic generated from proprietary aqueous solubility data and I direct readers to the criticism of SFI in KM2023.] Minimising aromatic ring count is an important and statistically significant metric to consider [25] [The importance of minimizing aromatic ring count is debatable and it is meaningless to describe metrics as “statistically significant”.] - consistent with the "escape from flatland" concept [26] that focusses on increasing the sp³ (versus sp²) ratio in molecules, [The focus in the “escape from flatland” study is actually on the fraction of carbon atoms that are sp3 (Fsp3) and not on “the sp³ (versus sp²) ratio”.] even though no significant trends are apparent in detailed analyses of sp³ fractions [27]. [The “analyses of sp³ fractions” in [27] consist of comparisons of drug - target medians for the periods 1939-1989, 1990-2009 and 2010-2020 and all appear to be statistically significant (although I don't consider these analyses to have any relevance to H2L work). I consider the citation of [27] in this context to be gratuitous and this blog post might be of interest.]

An important factor in hit selection is to prioritise compounds with higher ligand efficiency. Ligand efficiency, defined as activity [LE is actually defined in terms of Gibbs free energy of binding and not activity.] per heavy atom (LE=1.37 * pKi/Heavy Atom Count, Figure 7a), is commonly considered in discovery programmes as a quality metric [33]. [LE (Equation 3 in the H2L best practices post graphic) is actually defined as the Gibbs free energy of binding, ΔG° (Equation 2 in H2L best practices post graphic), divided by the number of non-hydrogen atoms, NnH (this is identical to heavy atom count although I consider the term to be less confusing), but the quantity is physically (and thermodynamically) meaningless because perception of efficiency varies with the arbitrary concentration, C°, that defines the standard state (see Table 1 in NoLE). Using a standard concentration enables us to calculate changes in free energy that result from changes in composition and, while the convention of  using C° = 1 M when reporting ΔG° values. is certainly useful, it would be no less (or more) correct to report ΔG° values for  C° = 1 µM. Put another way the widely held belief that 1 M is a 'privileged' standard concentration is thermodynamic nonsense (Equation 2 in the H2L best practices post graphic shows you how to interconvert ΔG° values between different standard concentrations). Given the serious deficiencies of LE as a drug design metric, I suggest modelling the response of affinity to molecular size and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the 'Alternatives to ligand efficiency for normalization of affinity' section of NoLE). There are two errors in the expression that the Authors have used for LE (the molar energy units are missing and the expression is written in terms of Ki rather than KD). The factor of 1.37 in the expression for LE  comes from the conversion of affinity (or potency) to ΔG° at a temperature of 300 K, as recommended in [35], although biochemical assays are typically are typically run at human body temperature (310 K). My view is that it is pointless to include the factor of 1.37 given that this entails dropping the molar energy units and using a temperature other than that at which the assay was run. Dropping the factor of 1.37 would also bring LE into line with LLE (LipE).] Various analyses suggest that, on average, this value barely change over the course of an optimisation process [20b | 27 | 34a | 34b] - so it is important to consider maintenance of any figure during any early SAR studies. [I disagree with this recommendation. These analyses are completely meaningless because the variation of LE over the course of an optimization itself varies with the concentration unit in which affinity (or potency) is expressed (Table 1 of NoLE illustrates this for three ligands of that differ in molecular size and potency). In [34a] the start and finish values values of LE were averaged over the different optimizations without showing variance and it is therefore not accurate to state that the study supports the assertion that LE values "barely change over the course of an optimisation process".Lipophilic Ligand Efficiency (activity minus lipophilicity typically pKi -log P, Figure 7b), which is widely recognised as the key principle in successful drug optimisation, comes into play both for hit prioritization and optimisation. [LLE is a simple mathematical expression and I don’t consider it accurate to describe it as a “principle” let alone “the key principle in successful drug optimisation”. LLE can be thought of as quantifying the energetic cost of transferring a ligand from octanol to its target binding site although this interpretation is only valid when the ligand is predominantly neutral at physiological pH and binds in its neutral form. LLE is just one of a number of ways to normalize potency with respect to lipophilicity and I don't think that anybody has actually demonstrated that (pIC50 – log P) is any better (or worse) as a drug design principle than pIC50 – 0.9 × log P. When drug discovery scientists report that they have used LLE it often means that they have plotted their project data in a similar manner to Figure 3 as opposed to staring at a table of LLE values for their compounds. As an alternative to LLE (LipE) for normalization of affinity (or potency) with respect to lipophilicity I suggest modelling the response and using the residuals to quantify the extent that individual potency measurements beat (or are beaten by) the trend in the data (the approach is outlined in the 'Alternatives to ligand efficiency for normalization of affinity' section of NoLE).] Improving this value reflects producing potent compounds without adding excessive lipophilicity. Taken together, it has been shown that for any given target, the drugs mostly lie towards the leading "nose" [?] where LE and LLE are both towards higher values [20b | 35]. [This perhaps not the penetrating an insight that the Authors consider it to be, given that drugs are usually more potent than the leads and hits from which they have been derived.] However, setting aspirational targets for either metric is unwise, as analysis of outcomes indicates that the values are target dependant [20b]. [I consider target dependency to be a complete red herring in this context and a more important issue is that you can’t compensate for inadequate potency by reducing molecular size or lipophilicity.]  Focusing on increasing LLE to the maximum range possible and prioritizing series with higher average values is the recommended strategy [27 | 36]. [It is not clear what is meant by “increasing LLE to the maximum range possible” and I consider it very poor advice indeed to recommend “prioritizing series with higher average values” (my view is that you actually need to be comparing the compounds from different series that have a realistic chance of matching the desired lead profile. The Authors of Q2025 appear to be misrepresenting [36] given that the study does not actually recommend “prioritizing series with higher average values”. This blog post on [27] might be relevant.]

One can summarize this section with a simple but critical best practice: potency and properties (physicochemical and ADMET) have to be optimized in parallel (Figure 8) [37] to get to quality leads and later drug candidates with higher chances of clinical success. Whilst seemingly trivial, this proposition is rendered challenging by an "addiction to potency" and a constant reminder of this critical concept remains useful for medicinal chemists [38]. [My view is that many medicinal chemists had already moved on from the addiction to potency when the molecular obesity article was published a decade and a half ago and I would question the article's relevance to contemporary H2L practice. The threshold values that define the GSK 4/400 rule actually come from an arbitrary scheme used to categorize the proprietary data analyzed in the G2008 study as opposed to being derived from objective analysis of the data. The study reproduces the promiscuity analysis from [13a] which I criticised earlier in this post for exaggerating the strength of the trend and using an excessively permissive threshold for ‘activity’.] With poor properties, even "good ligands" may not fully answer pharmacological questions [39a | 39b]. [These two articles focus on chemical probes and I don’t consider either article to have any relevance to H2L work.  Chemical probes need to be highly selective (more so than drugs) and permeable although solubility requirements are likely to be less stringent when using chemical probes to study intracellular phenomena than in H2L work and you don't generally need to worry about achieving oral bioavailability.] 

I agree that mapping SARs for structural series of interest is an important aspect of H2L work and activity cliffs (small modifications in structure resulting in large changes in activity) are of particular interest given the potential for beating trends and achieving greater selectivity. Instances of decreased lipophilicity resulting in increased potency (or at least minimal loss of potency) should also be of significant interest to H2L teams. When mapping SARs it is important that structural transformations should change a single pharmacophore feature at a time and one should always consider potential ‘collateral effects’, such as perturbed conformational preferences, that might confound the analysis. Some of the structural transformations shown in Figure 10 change more than one pharmacophore feature at a time which makes it impossible to determine which pharmacophore feature is required for activity.    

Figure 10. Conceptual example of iterative SAR [The meaning of the term “iterative SAR” is unclear] to determine the pharmacophore. As each change may affect binding interactions, conformation and ionization state; complementary structural modification [The meaning of "complementary structural modification" is unclear] will be needed to understand the change in potency and determine the pharmacophore 

Is Nitrogen needed (e.g. HBA)? [In addition to eliminating the quinoline N hydrogen bond acceptor this structural transformation eliminates a potential pharmacophore feature (the amide carbonyl oxygen can function as a hydrogen bond acceptor) while creating a cationic centre which will incur a significant desolvation penalty.]

Is NH needed? [This structural transformation eliminates the amide NH but it also is unlikely to address the question of whether the NH is needed because the amide carbonyl has also been eliminated.]

Is carbonyl needed? [The elimination of the amide carbonyl oxygen (hydrogen bond acceptor) creates a cationic centre which will incur a desolvation penalty.] 

As a last proposition, [49a | 49b] we suggest that the progress in computational physicochemical and ADMET property predictions represents an opportunity to accelerate the optimisation of molecules with a "predict-first" mindset [4 | 50]. [I certainly agree that models should be used if they are available. However, the citation of literature does appear to be gratuitous and it is unclear why the Authors believe that scientists working on H2L projects will benefit from knowing that a proprietary system for automated molecular design has been developed at GSK.]  The first step is to generate sufficient data for a series to build confidence in [51] any models, which can then be exploited in the prioritization of compounds for synthesis that fit with aspirational profiles [My view is that it would be very unwise for H2L project teams to blindly use models without assessing how well the models predict project data although I consider the  citation of [51] to be gratuitous been cited. Typically, H2L project teams use measured data to move their projects forward and generating data purely for the purpose of model evaluation is likely to be a distraction. One piece of advice that I will offer to H2L project teams is that they attempt to characterise responses of ADME predictors, such as aqueous solubility and permeability, to lipophilicity (likely to involve measurements for less potent compounds).] This ensures higher physicochemical quality [I consider “ensures” to be an exaggeration and I would argue that “physicochemical quality” is not something that can even be defined meaningfully or objectively (let alone quantified).], asks more pertinent questions and might reduce the total number of molecules made to get to the lead (Figure 11).

The Authors offer advice on how to ensure that optimisation is progressing in a satisfactory manner and how to know when to stop working on the series.

A Lead is not the perfect drug, but it gives reason to believe that the chemical series might be able to deliver one. An essential part of H2L (and later lead optimisation) is to ensure that the optimisation is progressing so that further investment is justified. Some essential questions can help achieve this: Does your series show dynamic SAR [The Authors need to say exactly what they mean by “dynamic SAR” if this is indeed the essential question that they assert it to be.] and achievable desired potencу? Is the preliminary ADMET data encouraging? [The Authors need to define “encouraging” if this is indeed an essential question.]  Do you have evidence of in vivo effect (PK/PD) at appropriate exposures? [I would question the necessity of PK/PD studies before starting lead optimisation and there are potential ethical concerns about doing in vivo work using compounds that lack the potency required for meaningful PK/PD assessment.] Do the remaining challenges show dynamic SAR and confidence they can be optimized? [The term “remaining challenges” is vague and it is not clear how H2L scientists are supposed to assess “dynamic SAR” for remaining challenges that are not defined in terms of activity.] To answer this, it's critical to monitor the trajectory [As I pointed out previously in the post it is not generally feasible to objectively map optimization paths and I consider the use of “trajectory” to be inappropriate in this context, given that it usually applies to a well-defined path that is determined at launch (for example, a molecular dynamics trajectory).] of the optimisation: e.g. by monitoring relevant properties over time. [Typically, H2L teams assess how closely the best compounds match the lead target profile (LTP) as opposed to monitoring time dependencies of properties such as log D that have limited predictivity.] In the absence of progress, discontinuing further work on a scaffold or series may be justified, with reason to focus on other promising structures or recommend termination on a data-driven basis. [Generally, the decision to terminate projects and series will be made on the basis of failure to satisfy the LTP.] 

It's been a long post and I'll say a big thank you for staying with me until the end. I wrote this post primarily for early-career scientists as well as for drug discovery scientists in academia and students (although I hope the feedback will also be helpful for the EFMC).  One piece of advice that I will offer to all scientists regardless of the stage of their careers is to not switch off your critical thinking skills just because a study is presented as defining best practices or has been highly-cited. In particular, I urge all scientists to be extremely wary of studies in which the conclusions don't follow from the data and I'll share a recent blog post that illustrates the problem. All that said, however, confused thinking amongst drug discovery scientists is not high on the list of the problems facing many of the world's inhabitants right now and my wish for 2026 is for a kinder, gentler, fairer and more peaceful world.