Molecular Design: July 2025

Here’s a photo from one of my exercise walks in Paramin and you can see the Caribbean Sea in the distance. This is perhaps my favourite view on the walk because it means that I’ve just got to the top of a particularly brutal hill (cars sometimes struggle to get to the top and on one occasion I watched a car fail miserably in four attempts) although you can’t always see the sea as clearly as in this photo.

The current post follows up on my post on the LR2024 study (Combining IC₅₀ or K_iValues from Different Sources Is a Source of Significant Noise). In the current post, I’ll be discussing in general terms how I might use ChEMBL to assemble data sets for training what I refer to in another post as regression-based machine learning (ML) models. These models can reasonably be described as quantitative structure-activity relationships (QSARs) because 'activity' is a continuous (as opposed to categorical) variable. However, the term 'QSAR' does appears to be less used these days, possibly reflecting the limited impact that QSAR approaches have made on real world drug discovery, and it's also much easier to persuade people that you're doing artificial intelligence (AI) if you describe your QSAR models as ML models. In this post I shall refer to regression-based ML models for biological activity simply as 'QSAR-like ML models'.

Much of the focus of AI-based drug design appears to be generation of novel chemical structures and devising synthetic routes for the relevant compounds. Many who tout AI as a panacea for the ills of drug discovery appear to be assuming that predictively useful QSAR-like ML models will be available or can readily be built even in the early stages of drug discovery projects. I remain skeptical and my view is that if sufficient data are available in ChEMBL for building useful QSAR-like ML models then it is likely that somebody else has already got to where you would like to be. Nevertheless, I do see value in automating the assembly of bioactivity data sets from ChEMBL even if it does not prove feasible to build useful QSAR-like ML models and I'll also be discussing some of the ways that you might use such data sets in the early stages of a drug discovery project.

My first step when assembling a data set (which I'll refer to as a 'bioactivity data set') for training QSAR-like ML models would be to extract from ChEMBL all (in-range) measured values for potency and affinity in assays that have been run against the target of interest. Potency and affinity should be expressed logarithmically for modelling as shown in the figure below and the relevant values are often referred to collectively as ‘pChEMBL’ values (I note in posts here from September and December of 2024, the term is used in the literature without being defined properly). I would generally anticipate that there will be only a single pChEMBL value for most compounds and for compounds for which there are multiple pChEMBL values I would use the mean of these to quantify bioactivity for each of these compounds. In cases where there is more than one pChEMBL value available for individual compounds I would also calculate the standard deviation since this is an other way to assess what is referred to as assay compatibility in the LR2024 study.

A bioactivity data set assembled in this manner would have a single bioactivity data value for each compound and I would take a look at how many compounds that data is available for because it might be possible to use this information for deciding whether or not to build a QSAR-like ML model. However, you need to be careful about using the size of the data set for making decisions like this because you can get away with with fewer data values if these are better distributed from the perspective of model-building (a view from Orwell's Animal Farm might have been: uniform good, bimodal bad) and the comment that Stalin is alleged to have made about the T-34 tank (quantity has a special quality all of its own) is perhaps not quite the ground truth that many ML modellers believe it to be. JFK's advice to ML modellers might have been: ask not whether you have enough data but whether the available data satisfy the requirements for modelling.

My next step would be to examine the distribution of data values in the bioactivity data set. I would take a look at the spread in bioactivity values (for modelling the spread in values should be large). If the distribution of the bioactivity data set is Gaussian then a standard deviation of 0.8 log units will place 80% of the data values in a range of 2.05 log units (I used this handy Normal percentile calculator) and I wouldn't attempt to build a QSAR-like ML model if the standard deviation was less than this (unless the person 'asking' me to build the model was also going to perform my annual performance review 😁). I would also visualise the distribution of bioactivity values because a noticeably polymodal distribution should ring a few alarm bells for me (clustering in training data may cause validation procedures to arrive at optimistic assessments of model quality).

Having established an acceptable spread in the bioactivity data I would take a look at where the distribution of bioactivity values is centred. Specifically, I would not attempt to build a QSAR-like ML model unless at least 50% of the compounds in the bioactivity data set exhibited sub-micromolar activity and for a Gaussian distribution this would correspond to a mean bioactivity value of 6. If this seems a bit extreme it’s worth pointing out that to accurately measure an IC₅₀ value of 10 μM requires that the compound be soluble, while neither aggregating nor interfering with assay read-out, at a concentration of 100 μM. Problems with biochemical assays typically increase when you test compounds at higher concentrations and this is one reason that biophysical assays are generally preferred for screening fragments. With sufficient care you can run biochemical assays at high concentrations and the S2009 article by former colleagues shows how you can assess (and potentially correct for) assay interference. Inadequate aqueous solubility, however, is not something that you can generally deal with. One difficulty when assembling data sets from ChEMBL for building QSAR-like ML models is that it can be very difficult to assess how carefully low affinity compounds have been assayed.

Before starting to assemble a data set for training QSAR-like ML models I would also assess the target from an assay perspective (in a real world drug discovery scenario this assessment would be done in collaboration with bioscientists). In particular, I would be looking for indications, such as k_inact values being reported, of activity being due to irreversible mechanisms of action. The bioactivity of an irreversible covalent inhibitor can be considered to be 'two-dimensional' (affinity for formation of non-covalently bound target-ligand complex and rate constant for covalent bond formation) and I'll point you to S2016 and McW2021 for more information. It is important to have sufficient spreads both in the k_inact and in K_i values when building QSAR-like ML models for irreversible inhibitors and you also need to be aware of any limits that the assays place on values that can reliably quantified. It is common for IC₅₀ values to be reported in the literature for irreversible inhibitors although, with care (see T2021), these can be used in drug discovery projects. However, it's important to bear in mind that using a single data value to quantify the bioactivity of an irreversible inhibitor necessarily results in information loss and that the ChEMBL curation procedures do not generally capture assay protocols at the level of detail that would be required for combining IC₅₀ values from different studies even when the inhibition is reversible. This should not be taken as a criticism of ChEMBL and I consider recording assay protocols in this level of detail to be well beyond the call of duty for those curating the bioactivity data.

Now let’s take a look at scenario in which the objective is to initiate a drug discovery project (as opposed to merely building QSAR-like ML models for the purpose of publication). One point that I really do need to stress is that you’re far from helpless if the data available in ChEMBL do not satisfy the requirements for building QSAR-like ML models. First, you can try to source structural analogs of bioactive compounds (there are many more options these days for doing this than when I worked in industry and you can also look beyond ChEMBL, in patents for example, when identifying bioactive compounds) and, in any case, you’re going to need to source pure samples for compounds to check that they are indeed bioactive. Second, you can use the use structures of the active compounds to set up queries for pharmacophore matching and molecular shape matching (see GGP1996 | N2010). Third, if structural information is available for the target you can investigate how the active compounds might be interacting with the target and use this information to source potentially active compounds (these days it is feasible to use free energy calculations to predict affinity in addition to the scoring functions that have long been used for virtual screening and I’ll point you to C2021 | MH2023 | C2023). Fourth, you can look for structure-activity relationships (see SHC2005 for an early example of this and the more recent S2025 study which provides software) in the bioactivity data and one way of achieving this is to search for ‘activity cliffs' (significant differences in bioactivity for pairs of structurally similar compounds; see M2006 | GvD2008 | SB2012 | SHB2019 | vT2022 ) or more generally by analysing bioactivity of neighbourhoods around bioactive compounds. Fifth, you can look for instances of increased polarity, such as replacement of aromatic CH with aromatic N) being well-tolerated from the perspective of bioactivity (this can be thought of both in terms of lipophilic efficiency and as a variation on the activity cliff theme).

Let’s now suppose that you can satisfy the data requirements or building QSAR-like models for the target of interest with data in ChEMBL. Does this mean that you can whip up some QSAR-like models, fire up your generative AI and get clinical candidates condensing out of the ether? I think not and one implication of being able build QSAR-like models using ChEMBL data is that others will have worked hard in the past trying to get to where you’d like to be in the future. Before you even start to build QSAR-like ML models you’ll need to assess the earlier work from the perspectives of both intellectual property and understanding why it didn't lead to clinical candidates. There are many rabbit holes that you can disappear down in drug discovery and here’s some advice from Otto von Bismarck (ironically it was a young, emotionally unstable, half-English Kaiser with a withered arm who brought down the Iron Chancellor):

Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.

If the available data do indeed satisfy the requirements for building QSAR-like ML models then it’s a pretty safe assumption that many of the data values will correspond to compounds from one or more structural series (see Figure 1 below which was taken from a previous post). Under this scenario the distribution of data points in the descriptor space is likely to be very uneven and you should anticipate that ‘global’ QSAR-like ML models built using such data will actually be ensembles of local models. One consequence of what I sometimes refer to as ‘clustering’ in the descriptor space is that what you might think is an interpolation is actually an extrapolation (take a look at the point highlighted by the arrow in Figure 1). Clustering in the descriptor space can also cause validation procedures to arrive at optimistic assessments of model quality because most data points have close neighbours and this can lead to overfitting (I discovered at EuroQSAR back in 2016 that some consider it rather uncouth to mention the H2003 study). Correlations between descriptors and related metrics such as Mahalanobis distance become less meaningful when there is a lot of clustering in the descriptor space. This in turn has implications for principal component analysis (commonly used to assess dimensionality of data sets and eliminate correlations between descriptors) and for methods such as PLS (see K1999) that aim to account for correlations between descriptors in regression analysis.

For reasons outlined in the previous paragraph I wouldn’t generally combine data from different structural series when building QSAR-like ML models. I would, however, look for relationships between different structural series by, for example, aligning their defining scaffolds (or structural prototypes if you prefer) because this may allow the SAR observed for one scaffold to be overlaid onto another scaffold. Before attempting to build a QSAR-like ML model I would plot pIC₅₀ of against calculated logP for structural series of interest with a view to assessing response of bioactivity to increased lipophilicity (a weak correlation between bioactivity and lipophilicity is desirable but if this is not the case then the response should be at least be relatively steep). I would also fit a straight line to the plot of pIC₅₀ versus calculated logP because this allows the steepness of the response to be quantified and the residuals can be used (as discussed in ‘Alternatives to ligand efficiency for normalization of affinity’ section of K2019) to quantify the extent to which individual pIC₅₀ values beat the trend in the data (this information can be useful to medicinal chemists who wish think about SAR although "the most interesting SAR is likely to be associated with the most deviant values" actually celebrates youthful antics of the Honourable former Member for Witney).

This is a good point at which to wrap up with some thoughts on the use of QSAR-like ML models in drug design. Back in 2009 I discussed (see K2009) the difference between hypothesis-driven molecular design and prediction-driven molecular design and I suggest that the former can be accommodated within an AI design framework. Some who assert the value of QSAR-like ML models for drug design appear to treat drug design as an exercise in prediction and I've been arguing for quite a few years (see this Jan2015 blog post) is that it is more appropriately seen in a Design of Experiments framework (generate the necessary data as efficiently as possible). For many drug discovery projects the available data will not satisfy the requirements for building QSAR-like ML models until relatively late in the project and in some cases clinical candidates will be discovered without ever being able to satisfy the data requirements for building QSAR-like ML models (this is more likely to be the case when bioactivity cannot be represented by a single data value as is the case for modalities such as irreversible inhibition and targeted protein degradation). I consider it essential to account for numbers of adjustable parameters and for correlations between descriptors (or features if you prefer) when building QSAR-like ML models, and I’m also concerned that the challenges presented by clustering in descriptor spaces are not properly acknowledged. It also needs to be said that it is consideration of exposure that differentiates drug design from ligand design and I recommend that everybody working in drug discovery and chemical biology read the SR2019 article.

Molecular Design

Sunday, 6 July 2025

Building datasets to train ML models for biological activity