Sunday 27 January 2019

Reviewing the reviewers


I recently published The Nature of Ligand Efficiency (NoLE) as a ChemRxiv preprint and this was featured (for all the right reasons) in a post at In The Pipeline. The material had been previously submitted to J Med Chem but it proved a bit too spicy for two of the three reviewers. I'll review the J Med Chem reviewers in this blog post and I hope that the feedback will be useful in the event of the journal being presented with similarly flavored material in the future. NoLE was my second publication from Berwick-on-Sea in the village of Blanchisseuse on the north coast of my native Trinidad and I'll include some photos from there to break up the text a bit.



Gate at Berwick-on-Sea in Blanchisseuse. The house was built (quite literally) by my late father (who would have been 89 today) and was named for my mother's home town of Berwick-upon-Tweed which has changed hands between England and Scotland on a number of occasions and may even still be at war with Imperial Russia.

The selection of reviewers for manuscripts that criticize previous studies presents a dilemma for journal editors. While it is prudent to consult those with a stake in what is being criticized, these may not the best people to ask about whether or not the criticism should be made. In particular, a reviewer using his/her position as a reviewer to suppress criticism of something in which he/she has a stake raises ethical questions. A stake in ligand efficiency (LE) could take any of a number of forms. First, one could have introduced a metric for LE. Second, one could have written articles endorsing ligand efficiency metrics or asserting their validity. Third, one could have enthusiastically promoted the LE metric at one's institution (e.g. by mandating that LE values be quoted when presenting project updates at the dog and pony shows that are an essential part of modern drug discovery). Fourth, one might be a devout member of the Fragment Cult (for whom the Doctrinal Correctness of LE is an Article of Faith).

There were three reviewers for my manuscript and I'll call them A, B and C since their numbers got scrambled between different rounds of review (also using the term 'Reviewer 3' might give some readers anxiety attacks). Reviewer A had nothing constructive to say and simply spat feathers. Reviewer B was very positive about the manuscript and made a number of  helpful suggestions. Reviewer C demanded that the manuscript be watered down to homeopathic levels (and that was never going to happen).

Here's my office at Berwick-on-Sea. That's a printout of NoLE on my desk (under the hanging beach towel).

The central theme of my manuscript is the argument that ligand efficiency is physically meaningless because perception of efficiency changes with the concentration unit in which affinity is expressed. This is actually a very serious criticism since since a change in perception resulting from a change in a unit would normally be regarded in physical science as an error in the "not even wrong" category.  It's not something that one can simply sweep under the carpet as a "limitation" of ligand efficiency. Despite their howls of protest, neither Reviewer A nor Reviewer C offered coherent counter-argument.

The tactic adopted by Reviewer C was to simply dismiss the physical arguments presented in the manuscript as "opinion" without presenting counter-argument. J Med Chem really does need to make it clear to reviewers that they need to do much better than this since it reflects badly on the journal.

Reviewer C. "'Physically meaningless' is at best an inflammatory opinion whereas the fact that other choices could have been made is often under-appreciated."
PWK. This criticism appears to be doctrinal rather than scientific and I note that Reviewer C has not offered counter-argument to the argument that LE is physically meaningless.


Here's a view of the Caribbean Sea. The 20 m drop from the gap in the vegetation is just as precipitous as you would expect although we've not (yet) lost any personnel or household pets over the edge.

Reviewer A struggled woefully with rudimentary physical chemistry throughout the review process and, given that I'd suggested a number of potential reviewers with the necessary expertise in molecular recognition and chemical thermodynamics, I was at a loss to understand why a reviewer who was so ill-equipped for the task at hand had been invited to review the manuscript.

Reviewer A. Reactions are considered to be spontaneous under standard conditions when the free energy is negative, but by changing the definition of C° in an arbitrary manner, any reaction can be said to be spontaneous or not. This is true in a trivial sense, but generations of researchers have found the concept of negative or positive free energies useful.
PWK. The flaw in this argument is that if you change the value of C° then you also change whether or not the reaction is spontaneous under the standard conditions. This is the basis of the law of mass action and it is also important to remember that KD values are not measured at single concentration. A chemical process (at constant temperature and pressure) by which the system changes from state A to state B will be spontaneous if DG[A®B]  is negative. Regardless of experiences of generations of researchers, medicinal chemists rarely (if ever) appear to use the sign of  D (e.g. for binding under assay conditions) when analyzing SAR or for making any other decisions.

This is the start to the path down to the lower deck

In one round of review, Reviewer C stated “I believe that it is incumbent on the author to argue that the choice of standard state used by medicinal chemists is not useful” and Reviewer A repeated the criticism in a subsequent round, noting that this was "the central problem with the manuscript". I thought this was a bit rich given that Reviewer A and Reviewer C had each accused me of using straw man tactics at different points in the review process. The more serious problem, however, is that we have two LE advocates each attempting to to transfer the burden of proof that (in science) one accepts as soon as one advocates that people take an action (e.g. use LE metrics). Reviewers A and C appeared to do this in order to evade their responsibility as reviewers to present counter-argument to the arguments in the manuscript. This would be like a thought leader (yes, there really are people who call themselves 'thought leaders') responding to criticism of a claim that AI was going to transform drug discovery by saying that it was incumbent on the critics to argue that AI was not useful. Imagine if they ran clinical trials like this?

At this point, Reviewer A did rather lose it and I was half expecting to have to fend off a counterattack by Steiner's division. Needless to say, the latest version of the manuscript now opens with "Ligand efficiency (LE) is, in essence, a good concept that is poorly served by a bad metric." and this can be considered the equivalent of a two-fingered gesture that is mistakenly attributed to the English and Welsh longbowmen at Agincourt.

Reviewer A. Dr. Kenny dodges this challenge by stating that the burden of proof should not be on him, but by arguing that LE is a “bad metric” despite its wide usage, he does in fact have to explain why free energy is also a “bad” concept. Not doing so makes the manuscript deeply misleading and therefore inappropriate for publication.
PWK. I only used the term “bad metric” in the conclusions where I wrote “Ligand efficiency is, in essence, a good concept served by a bad metric.” so it is incorrect to state that I have argued that LE is a “bad metric”. In any case, in the revised manuscript, I now question whether LE can accurately be described as a metric since neither its creators nor its advocates appear able (or willing) to say what it measures. Wide usage does not validate rules, guidelines or metrics and I note that, at one time, the prevailing view was that the sun orbited the earth. Once again, Reviewer A is making the serious error of assuming that everything that applies to free energy also applies to any function of free energy. The simple counter to Reviewer A’s challenge is that free energy is a state function and an integral part of the framework of thermodynamics. Although defined in terms of free energy, the LE metric is not is part of thermodynamics simply because it appears to require a privileged standard state.

I have occasionally stated that "useful is the last refuge of the scoundrel" and this tends to be misinterpreted as an assertion that utility of a model is unimportant. Nothing could actually be further from the truth and the statement is more a comment on the way that models can be 'validated' by simply labeling them as "useful". In some ways "useful" is analogous to the "God created it that way" statements that you will encounter if you are careless enough to become ensnared in arguments with Creationists. I should also point out that the manuscript did discuss the difficulties of demonstrating the utility of LE while neither A nor C presented any evidence (fervent belief does not usually constitute evidence in science) to support their assertion that the 1 M standard state is more useful than any other standard state.

Reviewer A appeared particularly aggrieved that one of The Great Unwashed should have the temerity to even question the value of LE and the toys were duly ejected from the pram. As my response below indicates, Reviewer A's comment is more what one might have expected from an inquisitor at a fifteenth century heresy trial than from an expert reviewer of a manuscript submitted to the premier medicinal chemistry journal. It is also worth pointing out that LE was touted as "useful" even as it was introduced in a 2004 letter to Drug Discovery Today and all three coauthors of that seminal contribution to the medicinal chemistry literature appeared to be blissfully unaware of the nontrivial dependency of their creation on the standard concentration. As such, I would argue that it would actually be a dereliction of duty not to question the utility of LE.

Reviewer A. Sixth, Dr. Kenny repeatedly questions the utility of LE; for example “The LE metric is claimed by advocates to be useful although it is rarely, if ever, shown to be predictive of pharmaceutically-relevant behavior” (p. 15) and “the LE metric is rarely, if ever, shown to be predictive of phenomena that are relevant to drug discovery” (p. 39).
PWK. This appears to be a doctrinal rather than scientific criticism.

Lower deck. I only swim from here if snorkeling because it's rocky.

Reviewer B was very positive about the "Molecular Size and Design Risk" section and made useful suggestions for its expansion. It's also worth mentioning that Derek quoted from this section in his post. However, Reviewer C suggested that the whole section be purged from the manuscript although it is possible that Reviewer C's underlying objective was to ensure that certain articles were not discussed. Reviewer C complained that my criticism of ref 48 was unfair although it may be that the reviewer considered ref 48 to be a liability (this post will give readers an idea why some LE advocates might consider ref 48 to be a liability). Another possibility is that the objection to criticism of ref 48 was actually a smokescreen and the real reason for suggesting that the section be purged was actually to avoid discussion of ref 45 (which might be considered to be an even greater liability by LE advocates).

Ref 58 and ref 59 are rare examples of articles that respond to criticism of LE and and a study such as NoLE really does need to discuss them (especially since both articles completely miss the point). The fundamental flaw that is common to both articles is that neither addresses the problems associated with the change in perception that results from using a different unit to express affinity. Reviewer C protested that it was gratuitous to single out ref 58 and even cited this 2014 post from Molecular Design in support of the charge that I was unfairly picking on ref 58. Reviewer C did seem rather rattled and also complained that I had quoted "non-scientific sections" of ref 59. I must confess to being unfamiliar with the concept that a scientific article can have non-scientific sections that can be declared off-limits for challenge. This was, perhaps, not Reviewer C's finest moment.

Reviewer A and Reviewer C both seemed rather keen that ref 94 not be discussed and they said that I should not be "attacking" fit quality (FQ) because it is rarely, if ever, used. I suspect the real reason was that both reviewers consider the metric (and ref 94) to be a significant liability from the LE perspective. I responded by noting that FQ had got its own box in the NRDD LE review and that ref 94 was cited in ref 58 (which asserts the validity of LE), suggesting that FQ may be of greater interest than Reviewer A and Reviewer C would have us believe. Another reason that Reviewer C might have preferred that the spotlight not be focused on FQ is that the discussion further exposes the illusion that fragments bind more efficiently than ligands of greater molecular size.

This is where I go swimming. It's a 5 minute walk from the house

So that concludes my review of the reviewers. I believe that the J Med Chem editors do need to think carefully about how (or even whether) they wish to have controversial topics addressed in their journal. Dr Eric Williams, the first Prime Minister of Trinidad and Tobago, suggested that his hearing impairment was an advantage in dealing with dissent because he could simply switch off his hearing aid. However, dealing with controversial topics in drug discovery might not be quite so simple. In particular, a journal needs to consider the potential vested interests of those from whom it seeks advice. For example, the Editors of a number of ACS journals may find it quite instructive to take a very close look at exactly how their journals came to endorse a frequent hitter model (trained on results from a panel of only six assays that all use the same readout) as a predictor of pan-assay interference...

I'll leave you with a selfie taken on the roof. A few minutes earlier I'd seen off a determined counter-attack by some jack spaniards (or should that be jacks spaniard?). Normally, I'd leave them alone but they were too close to where I needed to work. The technique is simple but its execution takes some nerve. First, arm yourself with a can of Baygon (don't forget to test it beforehand) and a broom. Second, with Baygon aimed, prod nest with broom. Third, spray a protective curtain of Baygon as the jack spaniards attack you (they are aggressive and they always attack). 

PWK one, jack spaniards nil 


Monday 21 January 2019

Response to Pat Walters on ML in drug discovery

Thanks again for your response, Pat, and I’ll try to both clarify my previous comments and respond to the challenges that you’ve presented (my comments are in red italics).

In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion. I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI.

[As a general point, it may be helpful to say what differentiates ML from other methods (e.g. partial least squares) that have been used for decades for modeling multivariate data in drug discovery. Should CoMFA be regarded as ML? If not, why not?]

You make the assertion that ML may be better for classification than regression, but don't explain why: "I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6)"

[My suspicions are aroused when I see articles like this in which the authors say “QSAR” but use a categorical definition of activity. At very least, I think modelers do need to justify the application of categorical methods to continuous data rather than presenting it fait accompli. J Med Chem addresses the categorization of continuous data in section 8g of the guidelines for authors.]

In my experience, the choice of regression vs classification is often dictated by the data rather than the method. If you have a dataset with 3-fold error and one log of dynamic range, you probably shouldn’t be doing regression. If you have a dataset that spans a reasonable dynamic range and isn’t, as you point out, bunched up at the ends of the distribution, you may be able to build a regression model.

[The trend that one is likely to observe in such a data set is likely to be very weak and I would still generally start with regression analysis because this shows the weakness in the trend clearly. The 3-fold error doesn’t magically disappear when you transform the continuous data to make it categorical (it translates to uncertainty in the categorization). Categorization of a data set like this may be justified if the distribution of the data suggests that it is highly clustered.]

Your argument about the number of parameters is interesting: "One of my concerns with cheminformatic ML is that it is not always clear how many parameters have been used to build the models (I’m guessing that, sometimes, even the modelers don’t know) and one does need to account for numbers of parameters if claiming that one model has outperformed another."

I think this one is a bit more tricky than it appears. In classical QSAR, many people use a calculated LogP. Is this one parameter? There were scores of fragment contributions and dozens of fudge factors that went into the LogP calculation, how do we account for these? Then again, the LogP parameters aren't adjustable in the QSAR model. I need to ponder the parameter question and how it applies to ML models which use things like regularization and early stopping to prevent overfitting.

[I would say that logP, whether calculated or measured, is a descriptor, rather than a parameter, in the context of QSAR (and ML) and that the model-building process does not ‘see’ the ‘guts’ of the logP prediction. In a multiple linear regression model (like a classical Hansch QSAR) there will be a single parameter (e.g. a1*logP) associated with logP. However, models that are non-linear with respect to logP will have more than one parameter associated with logP (e.g. a1*logP + a2*logP^2). In some cases, the model may appear to have a huge number of parameters although this may be an illusion because some methods for modeling do not allow the parameters to be varied independently of each other during the fitting process. The term ‘degrees of freedom’ is used in classical regression analysis to denote the number of parameters in a model (I don’t know if there is an analogous term for ML models).

As noted in my original post, the number of parameters used by ML models is not usually accounted for. Provided that the model satisfies validation criteria, the number of parameters is effectively treated as irrelevant. My view is that, unless the number of fitting parameters can be accounted for, it is not valid to claim that one model has outperformed another.]

I’m not sure I understand your arguments regarding chemical space. You conclude with the statement: “It is typically difficult to perceive structural relationships between compounds using models based on generic molecular descriptors”.

[I wasn’t nearly as clear here as I should have been. I meant molecular descriptors that are continuous-valued and define the dimensions of a space. By “generic” I mean descriptors that are defined for any molecular structure which has advantages (generality) and disadvantages (difficult to interpret models).  SAR can be seen in terms of structural relationships (e.g. X is the aza-substituted analog of Y) between compounds and the affinity differences that correspond to those relationships. What I was getting at is that it is difficult to perceive SAR using generic molecular descriptors (as defined above).] 

Validation is a lot harder than it looks. Our datasets tend to contain a great deal of hidden bias. There is a great paper from the folks at Atomwise that goes into detail on this and provides some suggestions on how to measure this bias and to construct training and test sets that limit the bias.

[I completely agree that validation is a lot harder than it looks and there is plenty of scope for debate about the different causes of the difficulty. I get uncomfortable when people declare models to be validated according to (what they claim are) best practices and suggest that the models should be used for regulatory purposes. I seem to remember sending an email to the vice chair of the 2005 or 2007 CADD GRC suggesting a session on model validation although there was little interest at the time. At EuroQSAR 2010, I suggested to the panel that the scientific committee should consider model validation as a topic for EuroQSAR 2012. The panel got a bit distracted by another point and, after I was sufficiently uncouth as make the point again, one of the panel declared that validation was a solved problem.]

I have to disagree with the statement that starts your penultimate paragraph: “While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets).”

Lead optimization projects where we are optimizing potency against a primary target are often places where ML models can make a significant impact. Once we’re into a lead-opt effort, we typically have a large amount of high-quality data, and can often identify sets of molecules with a consistent binding mode. In many cases, we are interpolating rather than extrapolating. These are situations where an ML model can shine. In addition, we are never simply optimizing activity against a primary target. We are simultaneously optimizing multiple parameters. In a lead optimization program, an ML model can help you to predict whether the change you are making to optimize a PK liability will enable you to maintain the primary target activity. This said, your ML model will be limited by the dynamic range of the observed data. The ML model won't predict a single digit nM compound if it has only seen uM compounds.

[I see LO as a process of SAR exploration and would not generally expect an ML model to predict the effects on affinity of forming new interactions and scaffold hops. While I would be confident that the affinity data for an LO project could be modelled, I am much less confident that hat the models will be useful in design. My guess is that, in order to have significant impact in LO, models for prediction of affinity will need to be specific to the structural series that the LO team is working on. Simple models (e.g. plot of affinity against logP) can be useful for defining the trend in the data which, in turn, allows us to quantify the extent to which to which the affinity of a compound beats the trend in the data (this is discussed in more detail in the Nature of Ligand Efficiency which proved a bit too spicy for two of the J Med Chem reviewers). Put another way a series-specific model with a small number of parameters, may be more useful than model with many parameters that is (apparently) more predictive. I would argue that we’re searching for positive outliers in drug design. It can also be helpful to draw a distinction between prediction-driven design and hypothesis-driven design.]

In contrast, there are a couple of confounding factors that make it more difficult to use ML to predict things like off-target activity. In some (perhaps most) cases, the molecules known to bind to an off-target may look nothing like the molecules you’re working on. This can make it difficult to determine whether your molecules fall within the applicability domain of the model. In addition, the molecules that are active against the off-target may bind to a number of different sites in a number of different ways.

[My suggestion that ML approaches may be better suited for prediction of physical properties and off-target activity was primarily a statement that data is likely to be available for a wider range of chemotypes in these situations than would be the case for primary target. My preferred approach to assessing potential for off-target activity would actually be to search for known actives that were similar (substructural; fingerprint; pharmacophore; shape) to the compounds of interest. Generally, I would be wary of predictions made by a model that had not ‘seen’ anything like the compounds of interest.] 

At the end of the day, ML is one of many techniques that can enable us to make better decisions on drug discovery projects. Like any other computational tool used in drug discovery, it shouldn’t be treated as an oracle. We need to use these tools to augment, rather than replace, our understanding of the SAR.

[Agreed although I believe that ML advocates need be clearer about what ML can do that the older methods can’t do. However, I do not see ML methods augmenting our understanding of SAR because neither the models nor the descriptors can generally be interpreted in structural terms.]

Thursday 17 January 2019

Thoughts on AI in Drug Discovery - A Practical View From the Trenches


I’ll be taking a look at machine learning (ML) in this post which was prompted by AI in Drug Discovery - A Practical View From the Trenches by Pat Walters in Practical Cheminformatics. Pat’s post appears to be triggered by Artificial Intelligence in Drug Design - The Storm Before the Calm? by Allan Jordan that was published as a viewpoint in ACS Medicinal Chemistry Letters. Some of what I said in the Nature of QSAR is relevant to what I’ll be saying in the current post and I'll also direct readers to Will CADD ever become relevant to drug discovery? by Ash at Curious Wavefunction.  Pat notes that Allan “fails to highlight specific problems or to define what he means by AI” and goes on to say that he prefers “to focus on machine learning (ML), a relatively well-defined subfield of AI”. Given that drug discovery scientists have been modeling activity and properties of compounds for decades now, some clarity would be welcome as to which of the methods used in the earlier work would fall under the ML umbrella.

While not denying the potential of AI and ML in drug design, I note that both are associated with a lot of hype and it would be an error to confuse skepticism about the hype with criticism of AI and ML. Nevertheless, there are some aspects of cheminformatic ML, such as chemical space coverage, that don't seem to get discussed quite as much as much as I think they should be and these are what the current post is focused on. I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6). When discussing ML in drug design, it can be useful to draw a distinction between 'direct applications' of ML (e.g. prediction of behavior of compounds) and 'indirect applications' of ML (e.g. synthesis planning; image analysis). This post is primarily concerned with direct applications of ML.

As has become customary, I’ve included some photos to break up the text a bit. These are all feature albatrosses and I took them on a 2009 visit to the South Island of New Zealand. Here's a live stream of nest at the Royal Albatross Centre on the Otago Peninsula.  

Spotted on Kaikoura whale watch

My comment on Pat’s post has just appeared so I’ll say pretty much what I said in that comment here. I would challenge the characterization of ML as “a relatively well-defined subfield of AI”. Typically, ML in cheminformatics focuses on (a) finding regions in descriptor space associated with particular chemical behaviors or (b) relating measures of chemical behavior to values of descriptors.  I would not automatically regard either of these activities as subfields of AI any more than I would regard Hansch QSAR, CoMFA, Free-Wilson Analysis, Matched Molecular Pair Analysis, Rule of 5 or PAINS filters as subfields of AI. I’m sure that there will be some cheminformatic ML models that can accurately be described as a subfield of AI but to tout each and every ML method as AI would be a form of hype.

At Royal Albatross Centre, Otago Peninsula.   

Pat states “In essence, machine learning can be thought of as ‘using patterns in data to label things’” and this could be taken as implying that ML models can only handle categorical responses. In drug design, the responses that we would like to predict using ML are typically continuous (e.g. IC50; aqueous solubility; permeability; fraction unbound; clearance; volume of distribution) and genuinely categorical data are rarely encountered in drug discovery projects. Nevertheless, it is common in drug discovery for continuous data to be made categorical (sometimes we say that the data has been binned). There are a number of reasons why this might not be such a great idea. First, binning continuous data throws away huge amounts of information. Second, binning continuous data distorts relationships between objects (e.g. a pIC50 activity threshold of 6 makes pIC50 = 6.1 appear to be more similar to pIC50 = 9 than to pIC50 = 5.9). Third, categorical analysis does not typically account for ordering (e.g. high | medium | low) of the categories. Fourth, one needs to show that the conclusions of analysis do not depend on how the continuous data has been categorized. The third and fourth issues are specifically addressed by the critique of Generation of a Set of Simple,Interpretable ADMET Rules of Thumb that was presented in Inflation of Correlation in the Pursuit of Drug-likeness.

Royal Albatross Centre, Otago Peninsula. 

Overfitting is always a concern when modelling multivariate data and the fit to the training data generally gets better when you use more parameters. One of my concerns with cheminformatic ML is that it is not always clear how many parameters have been used to build the models (I’m guessing that, sometimes, even the modelers don’t know) and one does need to account for numbers of parameters if claiming that one model has outperformed another. When building models from multivariate data, one also needs to account for relationships between the molecular descriptors that define the region(s) of chemical space occupied by the training set. In ‘traditional’ multivariate data analysis, it is assumed that relationships between descriptors are linear and modelers use principal component analysis (PCA) to determine the dimensions of the relevant regions of space. If relationships between descriptors are non-linear then life gets a lot more difficult. Another of my concerns with ML models is that it is not always clear how (or if) relationships between descriptors have been accounted for.

At Royal Albatross Centre, Otago Peninsula. 

Although an ML method may be generic and applicable to data from diverse sources, it is still useful to consider the characteristics of cheminformatic data that distinguish them from other types of data. As noted in Structure Modification in Chemical Databases, the molecular connection table (also known as the 2D molecular structure) is the defining data structure of cheminformatics. One characteristic of cheminformatic data is that is possible to make meaningful (and predictively useful) comparisons between structurally-related compounds and this provides a motivation for studying molecular similarity. In cheminformatic terms we can say that differences in chemical behavior can be perceived and modeled in terms of structural relationships between compounds. This can also be seen as a distance-geometric view of chemical space. Although this may sound a bit abstract, it’s actually how medicinal chemists tend to relate molecular structure to activity and properties (e.g. the bromo-substitution led to practically no improvement in potency but now it sticks like shit to the proverbial blanket in the plasma protein binding assay). This is also a useful framework for analysis of output from high-throughput screening (HTS) and design of screening libraries. It is typically difficult to perceive structural relationships between compounds using models based on generic molecular descriptors.

At Royal Albatross Centre, Otago Peninsula

I have been sufficiently uncouth as to suggest that many ‘global’ cheminformatic models may simply be ensembles of local models and this reflects a belief that training set compounds are often distributed unevenly in chemical space. As we move away from traditional Hansch QSAR to ML models, the molecular descriptors become more numerous (and less physical). When compounds are unevenly distributed in chemical space and molecular descriptors are numerous, it becomes unclear whether the descriptors are capturing the relevant physical chemistry or just organizing the compounds into groups of structurally related analogs. This is an important distinction and the following graphic (which does not feature an albatross) shows why. The graphic shows a simple plot of Y versus X and we want to use this to predict Y for X = 3.  If X is logP and Y is aqueous solubility then it would be reasonable to assume that X captures (at least some of) the physical chemistry and we would regard the prediction as an interpolation because X = 3 is pretty much at the center of this very simple chemical space. If X is simply partitioning the six compounds into two groups of structurally related analogs then making a prediction for X = 3 would represent an extrapolation. While this is clearly a very simple example, it does illustrate an issue that the cheminformatics community needs to take a bit more notice of.


Chemical space coverage is a key consideration for anyone using ML to predict activity and properties of for a series of structurally-related compounds. The term "Big Data" does tend to get over-used but being globally "big" is no guarantee that local regions of chemical space (e.g. the structural series that a medicinal chemistry team may be working on) are adequately covered. The difficulty for the chemists is that is they don't know whether their structural series is in a cluster in the training set space or in a hole. In cheminformatic terms, it is unclear whether or not the series that the medicinal chemistry team is working on lies within the applicability domain of the model.

Validation can lead to an optimistic view of model quality when training (and validation) sets are unevenly distributed in chemical space and I’ll ask you to have another look at Figure 1 and to think about what would happen if we did leave one out (LOO) cross validation. If we leave out any one of the data points from either group of in Figure 1, the two remaining data points ensure that the model is minimally affected. Similar problems can be encountered even when an external test set is used. My view is that training and test sets need to be selected to cover chemical space as evenly as possible in order to get a realistic assessment of model quality from the validation.  Put another way, ML modelers need to view the selection of training and test sets as a design problem in its own right.

At Royal Albatross Centre, Otago Peninsula

Given that Pat's post is billed as a practical view from the trenches, it may be worth saying something about some of the challenges of achieving genuine impact with ML models in real life drug design projects. Drug discovery is incremental in nature and a big part of the process is obtaining the data needed to make decisions as efficiently as possible. In order to have maximum impact on drug discovery, cheminformaticians will need to be involved how the data is obtained as well as analyzing the data.

Using an ML model is a data-hungry way to predict biological activity and, at the start of a project, the team is not usually awash with data. Molecular similarity searching, molecular shape matching and pharmacophore matching can deliver useful results using much less data than you would need for building a typical ML model while docking can be used even when there are no known ligands.

ML models that simply predict whether or not a compound will be "active" are unlikely to be of any value in lead optimization. Put another way, if you suggest to lead optimization chemists that they should make compound X rather than compound Y because it is more likely to have better than micromolar activity, they may think that you'd just stepped off the shuttle from the Planet Tharg. To be useful in lead optimization, a model for prediction of biological activity needs to predict pIC50 values (rather than whether or not pIC50 will exceed a threshold) and should be specific to the region of chemical space of interest to the lead optimization team. A model satisfying these requirements may well be more like the boring old QSAR that has been around for decades than the modern ML model. One difficulty that QSAR modelers have always faced when working on real life drug discovery projects is that key decisions have already been made by the time there is enough data with which to build a reliable model.

While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets). Furthermore, predictions for physicochemical properties and off-target activity don't usually need to be as accurate as predictions for activity against the primary target. Nevertheless, there will always be concerns about how effectively a model covers  relevant chemical space (e.g. structural series being optimized) and it may be safer to just get some measurements done. My advice to lead optimization chemists concerned about solubility would generally be to get measurements for three or four compounds spanning the lipophilicity range in the series and examine the response of aqueous solubility to lipophilicity.

I do have some thoughts on how cheminformatic models can be made more intelligent but this post is already too long so I'll need to discuss these in a future post. It's "até mais" from me (and the Royal Albatrosses of the South Island).