Sunday 20 October 2024

Assessment of AI-generated chemical structures using ML

Previous << || >> Next  

In an earlier post I considered what it might mean to describe drug design as AI-based. In this post I’ll take a general look at using machine learning (ML) to predict biological activity (and other pharmaceutically-relevant properties) for AI-generated chemical structures. Whether or not ML models ultimately prove to be fit for this purpose it is worth pointing out that many visionaries and thought leaders who tout computation as a panacea for humanity’s ills fail to recognize the complexity of biology (take a look at In The Pipeline posts from 2007 | 2015 | 2024). One point worth emphasizing in connection with the complexity of biology is that it is not currently possible to measure the concentration of a drug at its site of action for intracellular targets in live humans (here's an article on intracellular and intraorgan drug concentration that I recommend to everybody working in drug discovery and chemical biology). While I won't actually be saying anything about AI (here's a recent post from In The Pipeline that takes a look at how things are going for early movers in the field of AI drug discovery) in the current post I'll reiterate the point with which I concluded the earlier post:

One error commonly made by people with an AI/ML focus is to consider drug design purely as an exercise in prediction while, in reality, drug design should be seen more in a Design of Experiments framework.  

In that earlier post I noted that there’s a bit more to drug design than simply generating novel molecular structures and suggesting how the compounds should be synthesized. While I'm certainly not denying the challenges presented by the complexity of biology the current post will focus on some of the challenges associated with assessing chemical structures churned out by generative AI. One way of doing this is to build models for predicting biological activity and other pharmaceutically relevant properties such as aqueous solubility, permeability and metabolic stability. This is something that people have been trying to do for many years and the term ‘Quantitative Structure-Activity Relationship’ (QSAR) has been in use for over half a century (the inaugural EuroQSAR conference was held in Prague in 1973 a mere five years after Czechoslovakia had been invaded by the Soviet Union, the Polish People's Republic, the People's Republic of Bulgaria, and the Hungarian People's Republic). My view is that many of the ML models that get built with drug design in mind could accurately be described as QSAR models and I would not describe QSAR models as AI.

In the current post, I'll be discussing ML models for predicting quantities such as potency, aqueous solubility and permeability that are continuous variables which I refer to as 'regression-based ML models' (while some readers will not be happy with this label I do need to make it absolutely clear that the post is about one type of ML model and the label 'QSAR-like' could also have been used). I’ll leave classification models for another post although it’s worth mentioning that genuinely categorical data are actually rare in drug discovery (you should always be wary of gratuitous categorization of continuous data since this is a popular way to disguise the weakness of trends and KM2013 will give you some tips on what to look out for). It also needs to be stressed that the ML is a very broad label and that utility in one area (prediction of protein-folding for example) doesn't mean that that ML models will necessarily prove useful in other area.      

To build a regression-based ML model you first need to assemble a training set of compounds for which the appropriate measurements have been made and pIC50 values are commonly used to quantify biological activity (I recommend reading the LR2024 study on combining results from different assays although, as discussed in this post, I don’t consider it meaningful to combine data from multiple pairs of assays when calculating correlation-based metrics for assay compatibility). Next, you calculate values of descriptors for the chemical structures of the compounds in your training set (descriptors are typically derived from the connectivity in the chemical structure although atom counts and predicted values of physicochemical properties are also used). Finally, you use the ML modelling tools to find a function of the descriptors that best predicts the biological activity (or a pharmaceutically-relevant property) for the compounds in the training set. Generally you should also validate your models and this is especially important for models with large numbers of adjustable parameters.

There appears to be a general consensus that you need plenty of data for building ML models and some will even say “quantity has a quality all of its own” (this is sometimes stated as Stalin’s view of the T-34 tank although I consider this unlikely and the T-34 was actually an excellent tank which also happened to get produced in large numbers). Most people building regression-based ML models are also aware that you need a sufficiently wide spread in the measured data used for training the model (the variance in the measured data should be large in comparison with the precision of the measurement). Lead optimization is typically done within structural series and building a regression-based ML model that is predictively useful is likely to require data that have been measured for compounds in the structural series of interest.  These data requirements are quite stringent and I see this as one reason that QSAR approaches do not appear to have had much impact on the discovery of drugs despite the drug discovery literature being awash with QSAR articles. Back in 2009 (see K2009) I compared prediction-driven drug design with hypothesis-driven drug design, noting that the former is often not viable and that the latter is more commonly used in pharmaceutical and agrochemical discovery (former colleagues discussed hypothesis-driven molecular design in the context of the design-make-test-analyse cycle in the P2012 article).

With freshly painted T-34 at Brest Fortress, Belarus (June 2017)

There are some other points that you need to pay attention to when building regression-based ML models.  First, replicate measurements for the response variable (the quantity that you’re trying to predict) should be normally distributed and this is one reason why we model pIC50 rather than IC50. Second, the data values for the training set should be uniformly distributed in the descriptor space (my view, expressed in B2009, is that many 'global' predictive models are actually ensembles of local models). Third, the descriptors should not be strongly correlated or the method used to build the regression-based ML model must be able to account for relationships between descriptors (while it’s relatively straightforward to handle linear relationships between descriptors in simple regression analysis it’s not clear how effectively this can be achieved with more sophisticated algorithms used for building regression-based ML models).

I’ve created a graphic (Figure 1) to illustrate some of the modelling difficulties that result from uneven coverage in the descriptor space and it goes without saying that reality will be way more complex. The entities that occupy this chemical space are compounds and the coordinates of a point show the values of the descriptors X1 and X2 that have been calculated from the corresponding chemical structures (the terms ‘2D structure’ and ‘molecular graph’ also used). I’ve depicted real compounds for which measured data are available as black circles and virtual compounds (for which predictions are to be made) as five-pointed stars. The clusters (color-coded but also labelled A, B and C in case any readers are colour blind) are much more clearly defined than would be the case in a real chemical space. Proximity in chemical space implies similarity between compounds and the clusters might correspond to three different structural series.

Let’s suppose that we’ve been able to build a useful local model to predict pIC50 for each cluster even though we’ve not been able to build a predictively useful global model. Under this scenario you’d have a relatively high degree of confidence in the pIC50 values predicted for the virtual compounds (depicted as five-pointed stars) that lie within the clusters and a much lower degree of confidence in the virtual compound that is indicated by the arrow. If, however, we were to ignore the structure of the data and take a purely global view then we would conclude that the virtual compound indicated by the arrow occupied a central location in this region of chemical space and that the other three virtual compounds occupied peripheral locations. Put another way, the applicability domain of the model is not a single contiguous region of chemical space and what would appear to be an interpolation by a model is actually an extrapolation. 

It is important to take account of correlations between descriptors when building prediction models. A commonly employed tactic is to perform principal component analysis (PCA) which generates a new set of orthogonal descriptors and also provides an assessment of the dimensionality of the descriptor space. There are also ways to deal with correlations between descriptors in the model building process (PLS is the best known of these and the K1999 review might also be of interest). Correlations between descriptors also complicate interpretation of ML models and my stock response to any claim that an ML model is interpretable would be to ask how relationships between descriptors had been accounted for in the modelling of the data. An excellent illustrative example (see L2012) of a correlation between descriptors is the tendency of the presence of a basic nitrogen in a chemical structure to be associated with higher values of the Fsp3 descriptor (which, as pointed out in this post, should really be referred to as the I_ALI descriptor).

Let’s take another look at Figure 1. The axes of the ellipse representing Cluster A are aligned with the axes of the figure which tells us that X1 and X2 are uncorrelated for the compounds in this cluster.  Cluster B is also represented by an ellipse although its axes are not aligned with the axes of the figure which implies a linear correlation between X1 and X2 for the compounds in this cluster (you can use PCA to create two new orthogonal descriptors by rotating the plot around an axis that is perpendicular to the X1-X2 plane). Cluster C is a bigger problem because the correlation between X1 and X2 is non-linear (the cluster is not represented as an ellipse) and it would be rather more difficult to generate two new orthogonal descriptors for the compounds in this cluster. My view is that  PCA is less meaningful when there is a lot of clustering in data sets and I would also question the value or PLS and related methods in these situations. 

Let’s consider another scenario by supposing that we’ve been unable to build a useful local model for prediction of any of the three clusters in Figure 1.  If, however, the average pIC50 values differ for each of the three clusters we can still extract some predictivity from the data by finding a function of X1 and X2 that correlates with the average pIC50 values for the clusters. This is one way that clustering of compounds in the descriptor space can trick you into thinking that a global model has a broader applicability domain than is actually the case. Under this scenario it would be very unwise to try to interpret the model or use it to make predictions for compounds that sit outside the clusters. 

This is a good point at which to wrap up my post on regression-based ML (or QSAR-like if you prefer) models for predicting biological activity and other properties relevant to drug design such as aqueous solubility, permeability and metabolic stability. There appears to be a general consensus that building these models requires a lot of data and, in my view, this means that models like these are actually of limited utility in real world drug design. The basic difficulty is that a project team with enough data for building useful regression-based ML models is likely to be at a relatively advanced stage (the medicinal chemists will already understand the structure-activity relationships and are aware of project-specific issues such as poor aqueous solubility or high turnover by metabolic enzymes). Drug discovery scientists tend to be less aware of the problems that arise from clustering of compounds in descriptor space and, in my view, this is a factor that should be considered by those seeking to assemble data sets for benchmarking (see W2024). I'll leave you with a suggestion (it was considered a terrible idea at the time) I made over twenty years ago that each predicted value should be accompanied by chemical structures and measured values for the three closest neighbours in the descriptor space of the model.