Thursday 13 September 2018

On the Nature of QSAR

With EuroQSAR2018 fast approaching, I'll share some thoughts from Brazil since I won't be there in person. I've not got any QSAR related graphics handy so I'll include a few random photos to break the text up a bit.



East of Marianne River on north coast of Trinidad

Although Corwin Hansch is generally regarded as the "Father of QSAR", it is helpful to look further back to the work of Louis Hammett in order to see the prehistory of the field. Hammett introduced the concept of the linear free energy relationship (LFER) which forms the basis of the formulation of QSAR by Hansch and Toshio Fujita. However, the LFER framework encodes two other concepts that are also relevant to drug design. First, the definition of a substituent constant relates a change in a property to a change in molecular structure and this underpins matched molecular pair analysis (MMPA). Second, establishing an LFER allows the sensitivity of physicochemical behavior to structural change to be quantified and this can be seen as a basis for the activity cliff concept.


Kasbah cats in Ouarzazate 

As David Winkler and the late Prof. Fujita noted in this 2016 article, QSAR has evolved into "two QSARs":

Two main branches of QSAR have evolved. The first of these remains true to the origins of QSAR, where the model is often relatively simple and linear and interpretable in terms of molecular interactions or biological mechanisms, and may be considered “pure” or classical QSAR. The second type focuses much more on modeling structure–activity relationships in large data sets with high chemical diversity using a variety of regression or classification methods, and its primary purpose is to make reliable predictions of properties of new molecules—often the interpretation of the model is obscure or impossible.

I'll label the two branches of QSAR as "classical" (C) and "machine learning" (ML). As QSAR evolved from its origins into ML-QSAR, the descriptors became less physical and more numerous. While I would not attempt to interpret ML-QSAR models, I'd still be wary of interpreting a C-QSAR model if there was a high degree of correlation between the descriptors. One significant difficulty for those who advocate ML-QSAR is that machine learning is frequently associated with (or even equated to) artificial intelligence (AI) which, in turn, oozes hype. Here are a couple of recent In The Pipeline posts (don't forget to look at the comments) on machine learning and AI.

One difference between C-QSAR models and ML-QSAR models is that the former are typically local (training set compounds are closely related structurally) while the the latter are typically non-local (although not as global as their creators might have you believe). My view is that most 'global' QSAR models are actually ensembles of local models although many QSAR modelers would have me dispatched to the auto-da- for this heresy. A C-QSAR model is usually defined for a particular structural series (or scaffold) and the parameters are often specific (e.g. p value for C3-substituent) to the structural series. Provided that relevant data are available for training, one might anticipate that, within its applicability domain, local model will outperform a global model since the local model is better able to capture the structural context of the scaffold.

I would guess that most chemists would predict the effect on logP of chloro-substituting a compound more confidently than they would predict logP for the compound itself. Put another way, it is typically easier to predict the effect of a relatively small structural change (a perturbation) on chemical behavior than it is to predict chemical behavior directly from molecular structure. This is the basis for using free energy calculations to predict relative affinity and it also provides a motivation for MMPA (which can be seen as the data-analytic equivalent of free energy perturbation). This suggests viewing activity and properties in terms of structural relationships between compounds. I would argue that C-QSAR models are better able than ML-QSAR models to exploit structural relationships between compounds.


Down the islands with Venezuela in the distance 

ML-QSAR models typically use many parameters to fit the data and this means that more data is needed to build them. One of the issues that I have with machine learning approaches to modeling is that it is not usually clear how many parameters have been used to build the models (and it's not always clear that the creators of the models know). You can think of number of parameters as the currency in which you pay for the quality of fit to the training data and you need to account for number of parameters when comparing performance of different models. This is an issue that I think ML-QSAR advocates need to address.

Overfitting of training data is an issue even for C-QSAR models that use small numbers of parameters. Generally, it is assumed that if a model satisfies validation criteria it has not been over-fitted. However, cross-validation can lead to an optimistic assessment of model quality if the distribution of compounds in the training space is very uneven. An analogous problem can arise even when using external test sets. Hawkins advocated creating test sets by removing all representatives of particular chemotypes from training sets and I was sufficiently uncouth to mention this to one of the plenaries at EuroQSAR 2016. Training set design and model validation do not appear to be solved problems in the context of ML-QSAR.


The Corniche in Beirut 

I get the impression that machine learning algorithms may be better suited for classification than QSAR and it is common to see potency (or affinity) values classified as 'active' or 'inactive' for modeling. This creates a number of difficulties and I'll also point you towards the correlation inflation article that explains why gratuitous categorization of continuous data is very, very naughty. First, transformation of continuous data to categorical data throws away huge amounts of information which would seem to be the data science equivalent of shooting yourself in the foot. Second, categorization distorts your perception of the data (e.g. a pIC50 value of 6.5 might be regarded as more similar to one of 9.0 than one of 5.5). Third, a constant uncertainty in potency translates to a variable uncertainty in the classification. Fourth, if you categorize continuous data then you need to demonstrate that conclusions of analysis do not depend on the categorization scheme.

In the machine learning area not all QSAR is actually QSAR. This article reports that "the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods". However, the QSAR methods used appear to be based on categorical rather than quantitative definitions of activity. Even when more than two activity categories (e.g. high, medium, low) are defined, analysis might not be accounting for the ordering of the categories and this issue was also discussed in the correlation inflation article. Some clarification from the machine learning community may be in order as to which of their offerings can be used for modelling quantitative activity data.


I'll conclude the post by taking a look at where QSAR fits into the framework of drug design. Applying QSAR methods requires data and one difficulty for the modeler is that the project may have delivered its endpoint (or been put out of its misery) by the time that there is sufficient data for developing useful models. Simple models can be useful even if they are not particularly predictive. For example, modelling the response of pIC50 to logP makes it easy to see the extent to which the activity of each compound beats (or is beaten by) the trend in the data. Provided that there is sufficient range in the data, a weak correlation between pIC50 and logP is actually very desirable and I'll leave it to the reader to ponder why this might be the case. My view is that ML-QSAR models are unlikely to have significant impact for predicting potency against therapeutic targets in drug discovery projects.  

So that's just about all I've got to say. Have an enjoyable conference and make sure keep the speakers honest with your questions. It'd be rude not to.


Early evening in Barra 

No comments: