I’ll be taking a look at machine learning (ML) in this post which was
prompted by AI in Drug Discovery - A Practical View From the Trenches by Pat
Walters in Practical Cheminformatics. Pat’s post appears to be triggered by Artificial Intelligence in Drug Design - The Storm Before the Calm? by Allan Jordan that
was published as a viewpoint in ACS Medicinal Chemistry Letters. Some of what I
said in the Nature of QSAR is relevant to what I’ll be saying in the current post and I'll also direct readers to Will CADD ever become relevant to drug discovery? by Ash at Curious Wavefunction. Pat notes that Allan “fails to highlight
specific problems or to define what he means by AI” and goes on to say that he
prefers “to focus on machine learning (ML), a relatively well-defined subfield
of AI”. Given that drug discovery scientists have been modeling activity and properties of compounds for decades now, some clarity would be welcome as to which of the methods used in the earlier work would fall under the ML umbrella.
While not denying the potential of AI and ML in drug design, I note that both are associated with a lot of hype and it would be an error to confuse skepticism about the hype with criticism of AI and ML. Nevertheless, there are some aspects of cheminformatic ML, such as chemical space coverage, that don't seem to get discussed quite as much as much as I think they should be and these are what the current post is focused on. I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6). When discussing ML in drug design, it can be useful to draw a distinction between 'direct applications' of ML (e.g. prediction of behavior of compounds) and 'indirect applications' of ML (e.g. synthesis planning; image analysis). This post is primarily concerned with direct applications of ML.
As has become customary, I’ve included some photos to break up the text a bit. These are all feature albatrosses and I took them on a 2009 visit to the South Island of New
Zealand. Here's a live stream of nest at the Royal Albatross Centre on the Otago Peninsula.
Spotted on Kaikoura whale watch
My comment on Pat’s post has just appeared so I’ll say pretty much what I said in that comment here. I would
challenge the characterization of ML as “a relatively well-defined
subfield of AI”. Typically, ML in cheminformatics focuses on (a) finding regions
in descriptor space associated with particular chemical behaviors or (b) relating measures of chemical behavior to values of
descriptors. I would not automatically
regard either of these activities as subfields of AI any more than I would
regard Hansch QSAR, CoMFA, Free-Wilson Analysis, Matched
Molecular Pair Analysis, Rule of 5 or PAINS filters as subfields of AI. I’m
sure that there will be some cheminformatic ML models that can accurately be
described as a subfield of AI but to tout each and every ML method
as AI would be a form of hype.
At Royal Albatross Centre, Otago Peninsula.
Pat states “In essence, machine learning can be thought of as ‘using
patterns in data to label things’” and this could be taken as implying that ML models can only handle categorical responses. In drug design, the responses that we would like to predict using ML are typically
continuous (e.g. IC50; aqueous solubility; permeability; fraction unbound;
clearance; volume of distribution) and genuinely categorical data are rarely
encountered in drug discovery projects. Nevertheless, it is common in drug
discovery for continuous data to be made categorical (sometimes we say that the
data has been binned). There are a number of reasons why this might not be such
a great idea. First, binning continuous data throws away huge amounts of
information. Second, binning continuous data distorts relationships between
objects (e.g. a pIC50 activity threshold of 6 makes pIC50 = 6.1 appear to be
more similar to pIC50 = 9 than to pIC50 = 5.9). Third, categorical analysis does
not typically account for ordering (e.g. high | medium | low) of the categories. Fourth, one needs to show that the conclusions of analysis do not depend on how the continuous data has been categorized. The third and fourth issues are specifically addressed by the critique of Generation of a Set of Simple,Interpretable ADMET Rules of Thumb that was presented in Inflation of Correlation in the Pursuit of Drug-likeness.
Royal Albatross Centre, Otago Peninsula.
Overfitting is always a concern when modelling multivariate data and
the fit to the training data generally gets better when you use more parameters.
One of my concerns with cheminformatic ML is that it is not always clear how many
parameters have been used to build the models (I’m guessing that, sometimes, even
the modelers don’t know) and one does need to account for numbers of
parameters if claiming that one model has outperformed another.
When building models from multivariate data, one also needs to account for
relationships between the molecular descriptors that define the region(s) of chemical
space occupied by the training set. In ‘traditional’ multivariate data
analysis, it is assumed that relationships between descriptors are linear and
modelers use principal component analysis (PCA) to determine the dimensions of
the relevant regions of space. If relationships between descriptors are non-linear then life gets a lot more difficult. Another of my concerns with ML models is that it is not always clear how (or if) relationships between descriptors have been accounted for.
At Royal Albatross Centre, Otago Peninsula.
Although an ML method may be generic and applicable to data from diverse
sources, it is still useful to consider the characteristics of cheminformatic data
that distinguish them from other types of data. As noted in Structure Modification in Chemical Databases, the molecular connection table (also known
as the 2D molecular structure) is the defining data structure of
cheminformatics. One characteristic of cheminformatic data is that is possible
to make meaningful (and predictively useful) comparisons between structurally-related
compounds and this provides a motivation for studying molecular similarity. In cheminformatic terms we can say that differences in chemical behavior can be
perceived and modeled in terms of structural relationships between compounds. This can also be seen as a distance-geometric view of chemical space. Although
this may sound a bit abstract, it’s actually how medicinal chemists tend to relate molecular structure to activity and properties (e.g. the bromo-substitution led to practically no improvement in potency but now it sticks like shit to the proverbial blanket in the plasma protein binding assay). This is also a useful framework for analysis of
output from high-throughput screening (HTS) and design of screening libraries. It is typically difficult to
perceive structural relationships between compounds using models based on generic molecular descriptors.
At Royal Albatross Centre, Otago Peninsula
I have been sufficiently uncouth as to suggest that many ‘global’ cheminformatic models may simply be ensembles
of local models and this reflects a belief that training set compounds are
often distributed unevenly in chemical space. As we move away from traditional
Hansch QSAR to ML models, the molecular descriptors become more numerous (and
less physical). When compounds are unevenly distributed in chemical space and molecular
descriptors are numerous, it becomes unclear whether the descriptors are
capturing the relevant physical chemistry or just organizing the compounds into
groups of structurally related analogs. This is an important distinction and
the following graphic (which does not feature an albatross) shows why. The
graphic shows a simple plot of Y versus X and we want to use this to predict Y
for X = 3. If X is logP and Y is aqueous
solubility then it would be reasonable to assume that X captures (at least some
of) the physical chemistry and we would regard the prediction as an
interpolation because X = 3 is pretty much at the center of this very simple
chemical space. If X is simply partitioning the six compounds into two groups of
structurally related analogs then making a prediction for X = 3 would represent an
extrapolation. While this is clearly a very simple example, it does illustrate
an issue that the cheminformatics community needs to take a bit more notice of.
Chemical space coverage is a key consideration for anyone using ML to predict activity and properties of for a series of structurally-related compounds. The term "Big Data" does tend to get over-used but being globally "big" is no guarantee that local regions of chemical space (e.g. the structural series that a medicinal chemistry team may be working on) are adequately covered. The difficulty for the chemists is that is they don't know whether their structural series is in a cluster in the training set space or in a hole. In cheminformatic terms, it is unclear whether or not the series that the medicinal chemistry team is working on lies within the applicability domain of the model.
Validation can lead to an optimistic view of model quality when training (and validation) sets are unevenly distributed in chemical space and I’ll ask you to have another look at Figure 1 and to think about what would happen if we did leave one out (LOO) cross validation. If we leave out any one of the data points from either group of in Figure 1, the two remaining data points ensure that the model is minimally affected. Similar problems can be encountered even when an external test set is used. My view is that training and test sets need to be selected to cover chemical space as evenly as possible in order to get a realistic assessment of model quality from the validation. Put another way, ML modelers need to view the selection of training and test sets as a design problem in its own right.
At Royal Albatross Centre, Otago Peninsula
Given that Pat's post is billed as a practical view from the trenches, it may be worth saying something about some of the challenges of achieving genuine impact with ML models in real life drug design projects. Drug discovery is incremental in nature and a big part of the process is obtaining the data needed to make decisions as efficiently as possible. In order to have maximum impact on drug discovery, cheminformaticians will need to be involved how the data is obtained as well as analyzing the data.
Using an ML model is a data-hungry way to predict biological activity and, at the start of a project, the team is not usually awash with data. Molecular similarity searching, molecular shape matching and pharmacophore matching can deliver useful results using much less data than you would need for building a typical ML model while docking can be used even when there are no known ligands.
ML models that simply predict whether or not a compound will be "active" are unlikely to be of any value in lead optimization. Put another way, if you suggest to lead optimization chemists that they should make compound X rather than compound Y because it is more likely to have better than micromolar activity, they may think that you'd just stepped off the shuttle from the Planet Tharg. To be useful in lead optimization, a model for prediction of biological activity needs to predict pIC50 values (rather than whether or not pIC50 will exceed a threshold) and should be specific to the region of chemical space of interest to the lead optimization team. A model satisfying these requirements may well be more like the boring old QSAR that has been around for decades than the modern ML model. One difficulty that QSAR modelers have always faced when working on real life drug discovery projects is that key decisions have already been made by the time there is enough data with which to build a reliable model.
While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets). Furthermore, predictions for physicochemical properties and off-target activity don't usually need to be as accurate as predictions for activity against the primary target. Nevertheless, there will always be concerns about how effectively a model covers relevant chemical space (e.g. structural series being optimized) and it may be safer to just get some measurements done. My advice to lead optimization chemists concerned about solubility would generally be to get measurements for three or four compounds spanning the lipophilicity range in the series and examine the response of aqueous solubility to lipophilicity.
While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets). Furthermore, predictions for physicochemical properties and off-target activity don't usually need to be as accurate as predictions for activity against the primary target. Nevertheless, there will always be concerns about how effectively a model covers relevant chemical space (e.g. structural series being optimized) and it may be safer to just get some measurements done. My advice to lead optimization chemists concerned about solubility would generally be to get measurements for three or four compounds spanning the lipophilicity range in the series and examine the response of aqueous solubility to lipophilicity.
I do have some thoughts on how cheminformatic models can be made more intelligent but this post is already too long so I'll need to discuss these in a future post. It's "até mais" from me (and the Royal Albatrosses of the South Island).