Monday, 21 January 2019

Response to Pat Walters on ML in drug discovery

Thanks again for your response, Pat, and I’ll try to both clarify my previous comments and respond to the challenges that you’ve presented (my comments are in red italics).

In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion. I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI.

[As a general point, it may be helpful to say what differentiates ML from other methods (e.g. partial least squares) that have been used for decades for modeling multivariate data in drug discovery. Should CoMFA be regarded as ML? If not, why not?]

You make the assertion that ML may be better for classification than regression, but don't explain why: "I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6)"

[My suspicions are aroused when I see articles like this in which the authors say “QSAR” but use a categorical definition of activity. At very least, I think modelers do need to justify the application of categorical methods to continuous data rather than presenting it fait accompli. J Med Chem addresses the categorization of continuous data in section 8g of the guidelines for authors.]

In my experience, the choice of regression vs classification is often dictated by the data rather than the method. If you have a dataset with 3-fold error and one log of dynamic range, you probably shouldn’t be doing regression. If you have a dataset that spans a reasonable dynamic range and isn’t, as you point out, bunched up at the ends of the distribution, you may be able to build a regression model.

[The trend that one is likely to observe in such a data set is likely to be very weak and I would still generally start with regression analysis because this shows the weakness in the trend clearly. The 3-fold error doesn’t magically disappear when you transform the continuous data to make it categorical (it translates to uncertainty in the categorization). Categorization of a data set like this may be justified if the distribution of the data suggests that it is highly clustered.]

Your argument about the number of parameters is interesting: "One of my concerns with cheminformatic ML is that it is not always clear how many parameters have been used to build the models (I’m guessing that, sometimes, even the modelers don’t know) and one does need to account for numbers of parameters if claiming that one model has outperformed another."

I think this one is a bit more tricky than it appears. In classical QSAR, many people use a calculated LogP. Is this one parameter? There were scores of fragment contributions and dozens of fudge factors that went into the LogP calculation, how do we account for these? Then again, the LogP parameters aren't adjustable in the QSAR model. I need to ponder the parameter question and how it applies to ML models which use things like regularization and early stopping to prevent overfitting.

[I would say that logP, whether calculated or measured, is a descriptor, rather than a parameter, in the context of QSAR (and ML) and that the model-building process does not ‘see’ the ‘guts’ of the logP prediction. In a multiple linear regression model (like a classical Hansch QSAR) there will be a single parameter (e.g. a1*logP) associated with logP. However, models that are non-linear with respect to logP will have more than one parameter associated with logP (e.g. a1*logP + a2*logP^2). In some cases, the model may appear to have a huge number of parameters although this may be an illusion because some methods for modeling do not allow the parameters to be varied independently of each other during the fitting process. The term ‘degrees of freedom’ is used in classical regression analysis to denote the number of parameters in a model (I don’t know if there is an analogous term for ML models).

As noted in my original post, the number of parameters used by ML models is not usually accounted for. Provided that the model satisfies validation criteria, the number of parameters is effectively treated as irrelevant. My view is that, unless the number of fitting parameters can be accounted for, it is not valid to claim that one model has outperformed another.]

I’m not sure I understand your arguments regarding chemical space. You conclude with the statement: “It is typically difficult to perceive structural relationships between compounds using models based on generic molecular descriptors”.

[I wasn’t nearly as clear here as I should have been. I meant molecular descriptors that are continuous-valued and define the dimensions of a space. By “generic” I mean descriptors that are defined for any molecular structure which has advantages (generality) and disadvantages (difficult to interpret models).  SAR can be seen in terms of structural relationships (e.g. X is the aza-substituted analog of Y) between compounds and the affinity differences that correspond to those relationships. What I was getting at is that it is difficult to perceive SAR using generic molecular descriptors (as defined above).] 

Validation is a lot harder than it looks. Our datasets tend to contain a great deal of hidden bias. There is a great paper from the folks at Atomwise that goes into detail on this and provides some suggestions on how to measure this bias and to construct training and test sets that limit the bias.

[I completely agree that validation is a lot harder than it looks and there is plenty of scope for debate about the different causes of the difficulty. I get uncomfortable when people declare models to be validated according to (what they claim are) best practices and suggest that the models should be used for regulatory purposes. I seem to remember sending an email to the vice chair of the 2005 or 2007 CADD GRC suggesting a session on model validation although there was little interest at the time. At EuroQSAR 2010, I suggested to the panel that the scientific committee should consider model validation as a topic for EuroQSAR 2012. The panel got a bit distracted by another point and, after I was sufficiently uncouth as make the point again, one of the panel declared that validation was a solved problem.]

I have to disagree with the statement that starts your penultimate paragraph: “While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets).”

Lead optimization projects where we are optimizing potency against a primary target are often places where ML models can make a significant impact. Once we’re into a lead-opt effort, we typically have a large amount of high-quality data, and can often identify sets of molecules with a consistent binding mode. In many cases, we are interpolating rather than extrapolating. These are situations where an ML model can shine. In addition, we are never simply optimizing activity against a primary target. We are simultaneously optimizing multiple parameters. In a lead optimization program, an ML model can help you to predict whether the change you are making to optimize a PK liability will enable you to maintain the primary target activity. This said, your ML model will be limited by the dynamic range of the observed data. The ML model won't predict a single digit nM compound if it has only seen uM compounds.

[I see LO as a process of SAR exploration and would not generally expect an ML model to predict the effects on affinity of forming new interactions and scaffold hops. While I would be confident that the affinity data for an LO project could be modelled, I am much less confident that hat the models will be useful in design. My guess is that, in order to have significant impact in LO, models for prediction of affinity will need to be specific to the structural series that the LO team is working on. Simple models (e.g. plot of affinity against logP) can be useful for defining the trend in the data which, in turn, allows us to quantify the extent to which to which the affinity of a compound beats the trend in the data (this is discussed in more detail in the Nature of Ligand Efficiency which proved a bit too spicy for two of the J Med Chem reviewers). Put another way a series-specific model with a small number of parameters, may be more useful than model with many parameters that is (apparently) more predictive. I would argue that we’re searching for positive outliers in drug design. It can also be helpful to draw a distinction between prediction-driven design and hypothesis-driven design.]

In contrast, there are a couple of confounding factors that make it more difficult to use ML to predict things like off-target activity. In some (perhaps most) cases, the molecules known to bind to an off-target may look nothing like the molecules you’re working on. This can make it difficult to determine whether your molecules fall within the applicability domain of the model. In addition, the molecules that are active against the off-target may bind to a number of different sites in a number of different ways.

[My suggestion that ML approaches may be better suited for prediction of physical properties and off-target activity was primarily a statement that data is likely to be available for a wider range of chemotypes in these situations than would be the case for primary target. My preferred approach to assessing potential for off-target activity would actually be to search for known actives that were similar (substructural; fingerprint; pharmacophore; shape) to the compounds of interest. Generally, I would be wary of predictions made by a model that had not ‘seen’ anything like the compounds of interest.] 

At the end of the day, ML is one of many techniques that can enable us to make better decisions on drug discovery projects. Like any other computational tool used in drug discovery, it shouldn’t be treated as an oracle. We need to use these tools to augment, rather than replace, our understanding of the SAR.

[Agreed although I believe that ML advocates need be clearer about what ML can do that the older methods can’t do. However, I do not see ML methods augmenting our understanding of SAR because neither the models nor the descriptors can generally be interpreted in structural terms.]

Thursday, 17 January 2019

Thoughts on AI in Drug Discovery - A Practical View From the Trenches

I’ll be taking a look at machine learning (ML) in this post which was prompted by AI in Drug Discovery - A Practical View From the Trenches by Pat Walters in Practical Cheminformatics. Pat’s post appears to be triggered by Artificial Intelligence in Drug Design - The Storm Before the Calm? by Allan Jordan that was published as a viewpoint in ACS Medicinal Chemistry Letters. Some of what I said in the Nature of QSAR is relevant to what I’ll be saying in the current post and I'll also direct readers to Will CADD ever become relevant to drug discovery? by Ash at Curious Wavefunction.  Pat notes that Allan “fails to highlight specific problems or to define what he means by AI” and goes on to say that he prefers “to focus on machine learning (ML), a relatively well-defined subfield of AI”. Given that drug discovery scientists have been modeling activity and properties of compounds for decades now, some clarity would be welcome as to which of the methods used in the earlier work would fall under the ML umbrella.

While not denying the potential of AI and ML in drug design, I note that both are associated with a lot of hype and it would be an error to confuse skepticism about the hype with criticism of AI and ML. Nevertheless, there are some aspects of cheminformatic ML, such as chemical space coverage, that don't seem to get discussed quite as much as much as I think they should be and these are what the current post is focused on. I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6). When discussing ML in drug design, it can be useful to draw a distinction between 'direct applications' of ML (e.g. prediction of behavior of compounds) and 'indirect applications' of ML (e.g. synthesis planning; image analysis). This post is primarily concerned with direct applications of ML.

As has become customary, I’ve included some photos to break up the text a bit. These are all feature albatrosses and I took them on a 2009 visit to the South Island of New Zealand. Here's a live stream of nest at the Royal Albatross Centre on the Otago Peninsula.  

Spotted on Kaikoura whale watch

My comment on Pat’s post has just appeared so I’ll say pretty much what I said in that comment here. I would challenge the characterization of ML as “a relatively well-defined subfield of AI”. Typically, ML in cheminformatics focuses on (a) finding regions in descriptor space associated with particular chemical behaviors or (b) relating measures of chemical behavior to values of descriptors.  I would not automatically regard either of these activities as subfields of AI any more than I would regard Hansch QSAR, CoMFA, Free-Wilson Analysis, Matched Molecular Pair Analysis, Rule of 5 or PAINS filters as subfields of AI. I’m sure that there will be some cheminformatic ML models that can accurately be described as a subfield of AI but to tout each and every ML method as AI would be a form of hype.

At Royal Albatross Centre, Otago Peninsula.   

Pat states “In essence, machine learning can be thought of as ‘using patterns in data to label things’” and this could be taken as implying that ML models can only handle categorical responses. In drug design, the responses that we would like to predict using ML are typically continuous (e.g. IC50; aqueous solubility; permeability; fraction unbound; clearance; volume of distribution) and genuinely categorical data are rarely encountered in drug discovery projects. Nevertheless, it is common in drug discovery for continuous data to be made categorical (sometimes we say that the data has been binned). There are a number of reasons why this might not be such a great idea. First, binning continuous data throws away huge amounts of information. Second, binning continuous data distorts relationships between objects (e.g. a pIC50 activity threshold of 6 makes pIC50 = 6.1 appear to be more similar to pIC50 = 9 than to pIC50 = 5.9). Third, categorical analysis does not typically account for ordering (e.g. high | medium | low) of the categories. Fourth, one needs to show that the conclusions of analysis do not depend on how the continuous data has been categorized. The third and fourth issues are specifically addressed by the critique of Generation of a Set of Simple,Interpretable ADMET Rules of Thumb that was presented in Inflation of Correlation in the Pursuit of Drug-likeness.

Royal Albatross Centre, Otago Peninsula. 

Overfitting is always a concern when modelling multivariate data and the fit to the training data generally gets better when you use more parameters. One of my concerns with cheminformatic ML is that it is not always clear how many parameters have been used to build the models (I’m guessing that, sometimes, even the modelers don’t know) and one does need to account for numbers of parameters if claiming that one model has outperformed another. When building models from multivariate data, one also needs to account for relationships between the molecular descriptors that define the region(s) of chemical space occupied by the training set. In ‘traditional’ multivariate data analysis, it is assumed that relationships between descriptors are linear and modelers use principal component analysis (PCA) to determine the dimensions of the relevant regions of space. If relationships between descriptors are non-linear then life gets a lot more difficult. Another of my concerns with ML models is that it is not always clear how (or if) relationships between descriptors have been accounted for.

At Royal Albatross Centre, Otago Peninsula. 

Although an ML method may be generic and applicable to data from diverse sources, it is still useful to consider the characteristics of cheminformatic data that distinguish them from other types of data. As noted in Structure Modification in Chemical Databases, the molecular connection table (also known as the 2D molecular structure) is the defining data structure of cheminformatics. One characteristic of cheminformatic data is that is possible to make meaningful (and predictively useful) comparisons between structurally-related compounds and this provides a motivation for studying molecular similarity. In cheminformatic terms we can say that differences in chemical behavior can be perceived and modeled in terms of structural relationships between compounds. This can also be seen as a distance-geometric view of chemical space. Although this may sound a bit abstract, it’s actually how medicinal chemists tend to relate molecular structure to activity and properties (e.g. the bromo-substitution led to practically no improvement in potency but now it sticks like shit to the proverbial blanket in the plasma protein binding assay). This is also a useful framework for analysis of output from high-throughput screening (HTS) and design of screening libraries. It is typically difficult to perceive structural relationships between compounds using models based on generic molecular descriptors.

At Royal Albatross Centre, Otago Peninsula

I have been sufficiently uncouth as to suggest that many ‘global’ cheminformatic models may simply be ensembles of local models and this reflects a belief that training set compounds are often distributed unevenly in chemical space. As we move away from traditional Hansch QSAR to ML models, the molecular descriptors become more numerous (and less physical). When compounds are unevenly distributed in chemical space and molecular descriptors are numerous, it becomes unclear whether the descriptors are capturing the relevant physical chemistry or just organizing the compounds into groups of structurally related analogs. This is an important distinction and the following graphic (which does not feature an albatross) shows why. The graphic shows a simple plot of Y versus X and we want to use this to predict Y for X = 3.  If X is logP and Y is aqueous solubility then it would be reasonable to assume that X captures (at least some of) the physical chemistry and we would regard the prediction as an interpolation because X = 3 is pretty much at the center of this very simple chemical space. If X is simply partitioning the six compounds into two groups of structurally related analogs then making a prediction for X = 3 would represent an extrapolation. While this is clearly a very simple example, it does illustrate an issue that the cheminformatics community needs to take a bit more notice of.

Chemical space coverage is a key consideration for anyone using ML to predict activity and properties of for a series of structurally-related compounds. The term "Big Data" does tend to get over-used but being globally "big" is no guarantee that local regions of chemical space (e.g. the structural series that a medicinal chemistry team may be working on) are adequately covered. The difficulty for the chemists is that is they don't know whether their structural series is in a cluster in the training set space or in a hole. In cheminformatic terms, it is unclear whether or not the series that the medicinal chemistry team is working on lies within the applicability domain of the model.

Validation can lead to an optimistic view of model quality when training (and validation) sets are unevenly distributed in chemical space and I’ll ask you to have another look at Figure 1 and to think about what would happen if we did leave one out (LOO) cross validation. If we leave out any one of the data points from either group of in Figure 1, the two remaining data points ensure that the model is minimally affected. Similar problems can be encountered even when an external test set is used. My view is that training and test sets need to be selected to cover chemical space as evenly as possible in order to get a realistic assessment of model quality from the validation.  Put another way, ML modelers need to view the selection of training and test sets as a design problem in its own right.

At Royal Albatross Centre, Otago Peninsula

Given that Pat's post is billed as a practical view from the trenches, it may be worth saying something about some of the challenges of achieving genuine impact with ML models in real life drug design projects. Drug discovery is incremental in nature and a big part of the process is obtaining the data needed to make decisions as efficiently as possible. In order to have maximum impact on drug discovery, cheminformaticians will need to be involved how the data is obtained as well as analyzing the data.

Using an ML model is a data-hungry way to predict biological activity and, at the start of a project, the team is not usually awash with data. Molecular similarity searching, molecular shape matching and pharmacophore matching can deliver useful results using much less data than you would need for building a typical ML model while docking can be used even when there are no known ligands.

ML models that simply predict whether or not a compound will be "active" are unlikely to be of any value in lead optimization. Put another way, if you suggest to lead optimization chemists that they should make compound X rather than compound Y because it is more likely to have better than micromolar activity, they may think that you'd just stepped off the shuttle from the Planet Tharg. To be useful in lead optimization, a model for prediction of biological activity needs to predict pIC50 values (rather than whether or not pIC50 will exceed a threshold) and should be specific to the region of chemical space of interest to the lead optimization team. A model satisfying these requirements may well be more like the boring old QSAR that has been around for decades than the modern ML model. One difficulty that QSAR modelers have always faced when working on real life drug discovery projects is that key decisions have already been made by the time there is enough data with which to build a reliable model.

While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets). Furthermore, predictions for physicochemical properties and off-target activity don't usually need to be as accurate as predictions for activity against the primary target. Nevertheless, there will always be concerns about how effectively a model covers  relevant chemical space (e.g. structural series being optimized) and it may be safer to just get some measurements done. My advice to lead optimization chemists concerned about solubility would generally be to get measurements for three or four compounds spanning the lipophilicity range in the series and examine the response of aqueous solubility to lipophilicity.

I do have some thoughts on how cheminformatic models can be made more intelligent but this post is already too long so I'll need to discuss these in a future post. It's "até mais" from me (and the Royal Albatrosses of the South Island).

Friday, 30 November 2018

Ligand efficiency and fragment-to-lead optimizations

The third annual survey (F2L2017) of fragment-to-lead (F2L) optimizations was published last week. Given that it was the second survey (F2L2016) in this series, that prompted me to write 'The Nature of Ligand Efficiency' (NoLE), I thought that some comments would be in order. F2L2017 presents analysis of data that had been aggregated from all three surveys and I'll be focusing on the aspects of this analysis that relate to ligand efficiency (LE).

As noted in NoLE, perception of efficiency changes when affinity is expressed in different concentration units and I have argued that this is an undesirable feature for a quantity that is widely touted as useful for design. At very least, it does place a burden of proof on those who advocate the use of LE in design to either show that the change in perception of efficiency with concentration unit is not a problem or to justify their choice of the 1 M concentration unit. One difficulty that LE advocates face is that the nontrivial dependency of LE on the concentration unit only came to light a few years after LE was introduced as "a useful metric for lead selection" and, even now, some LE advocates appear to be in a state of denial. Put more bluntly, you weren't even aware that you were choosing the 1 M concentration unit when you started telling medicinal chemists that they should be using LE to do their jobs but you still want us to believe that you made the correct choice?

I'm assuming that the authors of F2L2017 would all claim familiarity with the fundamentals of physical chemistry and biophysics while some of the authors may even consider themselves to be experts in these areas. I'll put the following question to each of the authors of F2L2017: what would your reaction be to analysis showing that the space group for a crystal structure changed if the unit cell parameters were expressed using different units? I can also put things a bit more coarsely by noting that to examine the effect on perception of changing a unit is, when applicable, a most efficacious bullshit detector.

The analysis in F2L2017 that I'll focus on is the the comparison between fragment hits and leads. As I showed in NoLE, it is meaningless to compare LE values because LE has a nontrivial dependency on the concentration unit used to express affinity. LE advocates can of course declare themselves to be Experts (or even Thought Leaders) and invoke morality in support of their choice of the 1 M concentration unit. However, this is a risky tactic because physical science can't accommodate 'privileged' units and an insistence that quantities have to be expressed in specific units might be taken as evidence that one is not actually an Expert (at least not in physical science).

So let's take a look at what F2L2017 has to say about LE in the context of F2L optimizations.

"The distributions for fragment and lead LE have also remained reasonably constant. On average there is no significant change in LE between fragment and lead (ΔLE = 0.004, p ≈ 0.8). Figure 5A shows the distribution of ΔLE, which is approximately centered around zero, although interestingly there are more examples where LE increases from fragment to lead (40) than where a decrease is seen (25). Some caution is warranted when interpreting these data, as our minimum criterion for 100-fold potency improvement may have introduced some selection bias. Nevertheless, there is no clear evidence in this data set that LE changes systematically during fragment optimization. Although the average change in LE from fragment to lead is small, Figure 5B shows that the correlation between fragment and lead LE is modest (R2 = 0.22), with a mean absolute difference between fragment and lead LE of 0.08."

This might be a good point at which to remind the authors of F2L2017 about some of the more extravagant claims that have been made for LE. It has been asserted that “fragment hits typically possess high ‘ligand efficiency’ (binding affinity per heavy atom) and so are highly suitable for optimization into clinical candidates with good drug-like properties”.  It has also been claimed that "ligand efficiency validated fragment-based design".  However, the more important point is that it is completely meaningless to compare values of LE of hits and leads because you will come to different conclusions if you express affinity using a different concentration unit (see Table 2 in NoLE). It is also worth noting that expressing affinity in units of 1 M introduces selection bias just as does the requirement for 100-fold potency improvement. 

Had I been reviewing F2L2017, I'd have suggested that the authors might think a bit more carefully about exactly why they are analyzing differences between LE values for fragments and leads. A perspective on fragment library design (reviewed in this post) correctly stated that a general objective of optimization projects is “ensuring that any additional molecular weight and lipophilicity also produces an acceptable increase in affinity". If you're thinking along these lines then scaling the F2L potency increase by the corresponding increase in molecular size makes a lot more sense than comparing LE for the fragments and leads. This quantifies how efficiently (in terms of increased molecular size) the potency gains for the F2L project have been achieved. This is not a new idea and I'll direct readers toward a 2006 study in which it was noted that a tenfold increase in affinity corresponded to a mean increase in molecular weight of 64 Da (standard deviation = 18 Da) for 73 compound pairs from FBLD projects. This is how group efficiency (GE) works and I draw the attention of the two F2L2017 authors from Astex to a perceptive statement made by their colleagues that GE is “a more sensitive metric to define the quality of an added group than a comparison of the LE of the parent and newly formed compounds”.

The distinction between a difference in LE and a difference in affinity that has been scaled by a difference in molecular size becomes a whole lot clearer if you examine the relevant equations. Equation (1) defines the F2L LE difference and first thing that you'll notice is that is that it is algebraically more complex than equation (2). This is relevant because LE advocates often tout the simplicity of the LE metric. However, the more significant difference between the two is that the concentration that defines the standard state is present in equation (1) but absent in equation (2). This means that you get the same answer when you scale affinity difference by the corresponding molecular size difference regardless of the units in which you express affinity.

So let's see how things look if you're prepared to think beyond LE when assessing F2L optimizations. Here's a figure from NoLE in which I've plotted the change in affinity against the change in number of non-hydrogen atoms for the F2L optimizations surveyed in F2L2016. The molecular size efficiency for each optimization can be calculated by dividing the change in affinity by the change in in number of non-hydrogen atoms. I've drawn lines corresponding to minimum and maximum values of molecular size efficiency and have also shown the quartiles.

So now it's time to wrap things up. A physical quantity that is expressed in a different unit is still the same physical quantity and I presume that all the authors of F2L2017 would have been aware of this while they were still undergraduates. LE was described as thermodynamically indefensible in comments on Derek's post on NoLE and choosing to defend an indefensible position usually ends in tears (just as it did for the French at Dien Bien Phu in 1954). The dilemma facing those who seek to lead opinion in FBDD is that to embrace the view that the 1 M concentration unit is somehow privileged requires that they abandon fundamental physicochemical principles that they would have learned as undergraduates.   

Sunday, 14 October 2018

A PAINful itch

I've been meaning to take a look at the Seven Year Itch (SYI) article on PAINS for some time. SYI looks back over the the preceding 7 years of PAINS while presenting a view of future directions. One general comment that I would make of SYI is that it appears to try to counter criticisms of PAINS filters without explicitly acknowledging these criticisms.

This will a long post and strong coffee may be required. Before starting, it must be stressed that I neither deny that assay interference is a significant problem nor do I assert that compounds identified by PAINS filters are benign. The essence of my criticism of much of the PAINS analysis is that the rhetoric is simply not supported by the data.  It has always been easy to opine that chemical structures look unwholesome but it has always been rather more difficult to demonstrate that compounds are behaving pathologically in assays. One observation that I would make about modern drug discovery is that fact and opinion often become entangled to the extent that those who express (and seek to influence) opinions are no longer capable of distinguishing what they know from what they believe.

I've included some photos to break up the text a bit and these are from a 2016 visit to the north of Vietnam.  I'll start with this one taken from the western shore of Hoan Kiem Lake the night after the supermoon.

Hanoi moon

I found SYI to be something of a propaganda piece with all the coherence of a six-hour Fidel Castro harangue. As is typical for articles in the PAINS literature, SYI is heavy in speculation and opinion but is considerably lighter in facts and measured data. It wastes little time in letting readers know how many times the original PAINS article was cited. One criticism that I have made about the original PAINS article (that also applies to SYI and the articles in between) is that the article neither defines the term PAINS (other than to expand the acronym) nor does it provide objective criteria by which a compound can be shown experimentally to be (or not to be) a PAINS (or is that a PAIN). An 'unofficial' definition for the term PAINS has actually been published and I think that it's pretty good:

"PAINS, or pan-assay interference compounds, are compounds that have been observed to show activity in multiple types of assays by interfering with the assay readout rather than through specific compound/target interactions."

While PAINS purists might denounce the creators of the  unofficial PAINS definition for heresy and unspecified doctrinal errors, I would argue that the unofficial definition is more useful than the official definition (PAINS are pan-assay interference compounds). I would also point out that some of those who introduced the unofficial definition actually use experiments to study assay interference when much of the official PAINSology (or should that be PAINSomics) consists of speculation about the causes of  frequent-hitter behavior. One question that I shall put to you, the reader, is how often, when reading an article on PAINS, do you see real examples of experimental studies that have clearly demonstrated that specific compounds exhibit pan-assay interference?

Restored bunker and barbed wire at Strongpoint Béatrice which was the first to fall to the Viet Minh.

Although the reception of PAINS filters has generally been positive, JCIM has published two articles (the first by an Associate Editor of that journal and the second by me) that examine the PAINS filters critically from a cheminformatic perspective. The basis of the criticism is that the PAINS filters are predictors of frequent hitter behavior for assays using an AlphaScreen readout and they have been developed using proprietary data. It's a quite a leap from frequent-hitter behavior when tested at single concentrations in a panel of six AlphaScreen assays to pan-assay interference. In the language of cheminfomatics, we can state that the PAINS filters have been extrapolated out of a narrow applicability domain  and they have been reported (ref and ref) to be less predictive of frequent-hitter behavior in these situations. One point that I specifically made was that a panel of six assays all using the same readout is a suboptimal design of an experiment to detect and quantify pan-assay interference.

In my article, bad behavior in assays was classified as Type 1 ( assay result gives an incorrect indication of the extent to which the compound affects the function of the target) or Type 2 (compounds affect target function by an undesirable mechanism of action). I used these rather bland labels because I didn't want to become ensnared in a Dien Bien Phu of nomenclature and it must be stressed that there is absolutely no suggestion that other people use these labels. My own preference would actually be to only use the term interference for Type 1 bad behavior and it's worth remembering that Type 1 bad behavior can also lead to false negatives.

The distinction between Type 1 and and Type 2 behaviors is an important and useful one to make from the perspective of drug discovery scientists who are making decisions as to which screening hits to take forward. Type 1 behavior is undesirable because it means that you can't believe the screening result for hits but, provided that you can find an assay (e.g. label-free measurement of affinity) that is not interfered with, Type 1 behavior is a manageable, although irksome, problem. Running a second assay that uses an orthogonal readout may shed light on whether Type 1 behavior is an issue although, in some cases, it may be possible to assess, and even correct for, interference without running the orthogonal second assay. Type 2 behavior is a much more serious problem and a compound that exhibits Type 2 behavior needs to be put out of its misery as swiftly and mercifully as possible. The challenge presented by Type 2 behavior is that you need to establish the mechanism of action simply to determine whether or not it is desirable. Running a second assay with an orthogonal readout is unlikely to provide useful information since the effect on target function is real.

Barbed wire at Strongpoint Béatrice. I'm guessing that it was not far from here that, on the night of 13th/14th March, 1954, Captain Riès would have made the final transmission: "It's all over - the Viets are here. Fire on my position. Out."

Most (all?) of the PAINSology before SYI failed to make any distinction between Type 1 and Type 2 bad behavior. SYI states "There does not seem to be an industry-accepted nomenclature or ontology of anomalous binding behavior" and makes some suggestions as to how this state of affairs might be rectified. SYI recommends that "Actives" be first classified as "target modulators" or "readout modulators". The "target modulators" are all considered to be "true positives" and these are further classified as "true hits" or "false hits". All the "readout modulators" are labelled as "false positives". Unsurprisingly, the authors recommend that all the "false hits" and "false positives" be labelled as pan-assay interference compounds regardless of whether the compounds in question actually exhibit pan-assay interference. In general, I would advise against drawing a distinction between the terms "hit" and "positive" in the context of screening but, if you chose to do so, then you do really do need to define the terms much more precisely than the authors have done.

I think the term "readout modulator" is reasonable and is equivalent to my definition of Type 1 behavior (assay result gives an incorrect indication of the extent to which the compound affects the function of the target). However, I strongly disagree with the classification of compounds showing "non-specific interaction with target leading to active readout" as readout modulators since I'd regard any interaction with the target that affects its function to be modulation. My understanding is that the effects of colloidal aggregators on protein function are real (although not exploitable) and that it is often possible to observe reproducible concentration responses. My advice to the authors is that, if you're going to appropriate colloidal aggregators as PAINS, then you might at least put them in the right category.

While the term "target modulator" is also reasonable, it might not be a such great idea to use it in connection with assay interference since it's also quite a good description of a drug. Consider the possibility of homeopaths and anti-vaxxers denouncing the pharmaceutical industry for poisoning people with target modulators. However, I disagree with the use of the term "false hit" since the modulation of the target is real even when the mechanism of action is not exploitable. There is also a danger of confusing the "false hits" with the "false positives" and SYI is not exactly clear about the distinction between a "hit" and a "positive". In screening both terms tend to be used to specify results for which the readout exceeds a threshold value.

The defensive positions on one of the hills of Strongpoint Béatrice have not been restored. Although the trenches have filled in with time, they are not always as shallow as they appear to be in this photo (as I discovered when I stepped off the path).

It's now time to examine what SYI has to to say and singlet oxygen is as good a place as any to start from. One criticism of PAINS filters that I have made, both in my article and the Molecular Design blog, is that some of the frequent-hitter behavior in the PAINS assay panel may be due to quenching or scavenging of singlet oxygen which is an essential component of the the AlphaScreen readout. SYI states:

"However, while many PAINS classes contain some member compounds that registered as hits in all the assays analyzed and that therefore could be AlphaScreen-specific signal interference compounds, most compounds in such classes signal in only a portion of assays. For these, chemical reactivity that is only induced in some assays is a plausible mechanism for platform-independent assay interference."

The authors seem to be interpreting the observation that a compound only hits in a portion of assays as evidence for platform-independent assay interference. This is actually a very naive argument for a number of reasons. First, compounds do not all appear to have been assayed at the same concentration in the original PAINS assay panel and there may be other sources of variation that were not disclosed. Second, different readout thresholds may have been used for the assays in the panel and noise in the readout introduces a probabilistic element to whether or not the signal for a compound exceeds the threshold. Last, but definitely not least, the molecular structure of a compound does influence the efficiency with which it quenches or scavenges singlet oxygen. A recent study observed that PAINS "alerts appear to encode primarily AlphaScreen promiscuous molecules"

If you read enough PAINS literature, you'll invariably come across sweeping generalizations made about PAINS. For example, it has been claimed that "Most PAINS function as reactive chemicals rather than discriminating drugs." SYI follows this pattern and asserts:

"Another comment we frequently encounter and very relevant to this journal is that PAINS may not be appropriate for drug development but may still comprise useful tool compounds. This is not so, as tool compounds need to be much more pharmacologically precise in order that the biological responses they invoke can be unambiguously interpreted."

While it is encouraging that the authors have finally realized the significance of the distinction between readout modulators and target modulators, they don't seem to be fully aware of the implications of making this distinction. Specifically, one can no longer make the sweeping generalizations about PAINS that are common in PAINS literature. Consider a hypothetical compound that is an efficient quencher of singlet oxygen and that has shown up as a hit in all six AlphaScreen assays of the original PAINS assay panel. While many would consider this compound to be a PAINS (or PAIN), I would strongly challenge a claim that observation of frequent-hitter behavior in this assay panel would be sufficient to rule out the use of the compound as a tool.

SYI notes that PAINS are recognized by other independently developed promiscuity filters.

"The corroboration of PAINS classes by such independent efforts provides strong support for the structural filters and subsequent recognition and awareness of poorly performing compound classes in the literature. It is instructive therefore to introduce two more recent and fully statistically validated frequent-hitter analytical methods that are assay platform-independent. The first was reported in 2014 by AstraZeneca(16) and the second in 2016 by academic researchers and called Badapple.(27)"

I don't think it is particularly surprising (or significant) that some of the PAINS classes are recognized as frequent-hitters by other models for frequent-hitter behavior. What is not clear is how many of the PAINS classes are recognized by the other frequent-hitter models or how 'strong' the recognition is. I would challenge the description of the AstraZeneca frequent-hitter model as "fully statistically validated" since validation was performed using proprietary data. I made a similar criticism of the original PAINS study and would suggest that the authors take a look at what this JCIM editorial has to say about the use of proprietary data in modeling studies.       

The French named this place Eliane and it was quieter when I visited than it would have been on 6th May, 1954 when the Viet Minh detonated a large mine beneath the French positions. It has been said that the alphabetically-ordered (Anne-Marie to Isabelle) strongpoints at Dien Bien Phu were named for the mistresses of the commander, Colonel (later General) Christian de Castries although this is unlikely.

SYI summarizes as follows:

"In summary, we have previously discussed a variety of issues key to interpretation of PAINS filter outputs, ranging from HTS library design and screening concentration, relevance of PAINS-bearing FDA-approved drugs, issues in SMARTS to SLN conversion, the reality of nonfrequent hitter PAINS, as well as PAINS and non-PAINS that are respectively not recognized or recognized in the PAINS filters as originally published. However, nowhere has a discussion around these key principles been summarized in one article, and that is the point of the current article. Had this been the case, we believe some recent contributions to the literature would have been more thoughtfully directed. (21,32)"

I must confess that reference to the reality of nonfrequent hitter pan assay interference compounds would normally prompt me to advise authors to stay off the peyote until the manuscript has been safely submitted. However, the bigger problem embedded in the somewhat Rumsfeldesque first sentence is that you need objective and unambiguous criteria by which compounds can be determined to be PAINS or non-PAINS before you can talk about "key principles". You also need to acknowledge that interference with readout and and undesirable mechanisms of action are entirely different problems requiring entirely different solutions.

I noted that recent contributions to the literature from me and from a JCIM Associate Editor (who might know a bit more about cheminformatics than the authors) were criticized for being insufficiently thoughtful. To be criticized in this manner is, as the late, great Denis Healey might have observed, "like being savaged by a dead sheep". Despite what the authors believe, I can confirm that my contribution to the literature would have been very similar even if SYI had been published beforehand. Nevertheless, I would suggest to the authors that dismissing the feedback from a JCIM Associate Editor as if he were a disobedient schoolboy might not have been such a smart move. For example, it could get the JMC editors wondering a bit more about exactly what they'd got themselves into when they decided to endorse a frequent-hitter model as a predictor of pan-assay interference. The endorsement of a predictive model by a premier scientific journal represents a huge benefit to the creators of the model but the flip side is that it also represents a huge risk to the journal. 

So that's all that I want to say about PAINS and it's a  good point to wrap things up so that I can return to Vietnam for the remainder of the post.       

I'm pretty sure that neither General Giap nor General de Castries visited the summit of Fansipan which at 3143 meters is the highest point in Vietnam (I wouldn't have either had a cable car had not been installed a few months before I visited). It's a great place to enjoy the sunset.

Back in Hanoi, I attempted to pay my respects to Uncle Ho, as I've done on two previous visits to this city, but timing was not great (they were doing the annual formaldehyde change). Uncle Ho is in much better shape than Chairman Mao who is actually seven years 'younger' and this is a consequence of having been embalmed by the Russians (the acknowledged experts in this field). Chairman Mao had the misfortune to expire when Sino-Soviet relations were particularly frosty and his pickling was left to some of his less expert fellow citizens. It is also said that the Russian embalming team arrived in Hanoi before Uncle Ho had actually expired...

Catching up with Uncle Ho


Sunday, 7 October 2018

More hydrogen bonding asymmetries

I examined an article on the polarized nature of protein-ligand binding interfaces previously and promised that I'd discuss a completely different type of hydrogen bond asymmetry which is based not on structure but on energetics. Readers may be familiar with hydrogen bond (HB) acidity and basicity which may be quantified by measurement of 1:1 association constants for hydrogen bonded complexes in non-hydrogen bonding solvents (e.g. carbon tetrachloride). Here are three references (1 | 2 | 3) and I'll also mention that molecular electrostatic potential (MEP) can be used for prediction of both HB acidity and HB basicity.

As discussed in this article, measurements of HB acidity and basicity have their limitations when trying to use them to understand and predict solvation behavior in aqueous media. First, measuring the association constant for a 1:1 complex does not tell us what will happen when two water molecules simultaneously donate hydrogen bonds to the oxygen atom of a carbonyl group. Second, the measured association constants cannot be used to compare HB acceptors with HB donors. This may seem a perverse sort of thing to want to do but one of the things that drug designers are interested in is the ease of dragging different HB donors and acceptors out of water.

Lake Liadskoye

Prediction of alkane/water partition coefficients (logPalk) has been a long standing interest (1 | 2 | 3) of mine for a number of years. It turns out that analysis of logPalk values measured for structurally prototypical model compounds can tell us quite a lot about what happens when you drag individual HB donors and acceptors out of water. The analysis is based on the observation of a very strong correlation between molecular surface area (MSA) and logPalk. The figure below shows the response of logPalk to MSA for saturated hydrocarbons, aliphatic alcohols (single hydroxyl group) and aliphatic diols. The lines of fit are essentially parallel and equally spaced which suggests that the effect on logPalk of adding a hydroxyl group to a saturated hydrocarbon or to an aliphatic alcohol is constant. This suggests treating polar groups as perturbations of saturated hydrocarbons for prediction of logPalk and analysis of data like what is shown in Figure 1 can be used to parameterize the perturbations for different polar groups. The approach, described in this article is to first calculate logPalk for a hypothetical saturated hydrocarbon with the same MSA as the compound of interest and then to sum the parameters for the polar groups in the molecular structure to account for the introduction of these polar groups. 

Figure 1. Relationship between alkane/water logP and molecular surface area (MSA) for saturated hydrocarbons, saturated alcohols and saturated diols. Neither of the aliphatic diols (1,4-butanediol and 1,6-hexanediol) would be expected to form intramolecular HBs in water.  

I think we'll need more data (especially for heterocycles and species with intramolecular HBs) to make this approach to prediction of logPalk generally useful.  However, the size of the effect on logPalk of introducing an HB acceptor or donor into a saturated hydrocarbon does tell us how strongly the HB donor or acceptor interacts with water. It was actually this article which was published after our article on logPalk prediction that got me thinking along these lines. In our article, we showed how polarity can be defined for HB acceptors and donors and calculated from measured alkane/water partition coefficients. Polarity defined in this manner brings HB donors and acceptors onto the same scale and allows us to explore another type of hydrogen bonding asymmetry.

Insects exploiting surface tension in Belovezhskaya Pushcha

For HB acceptors, the approach is simple. First you need to identify appropriate model compounds for which logPalk has been measured. These have only the HB acceptor functional group of interest, saturated carbon and hydrogen in their molecular structures. Next, calculate logPalk for a saturated hydrocarbon with the same MSA as that for the model compound (use line for saturated hydrocarbons in Figure 1 to do this) and subtract the measured logPalk value from the calculated value. Things are a bit more complicated for HB donors because you can't usually have these without an HB acceptor (this is the 'baggage' I discussed in the previous post) and you need to deal with these on a case-by-case basis. For example, you might estimate the polarity of an amide NH by subtracting the polarity of the tertiary amide group from that of the secondary amide group.  Here's a table of polarity estimates for some hydrogen bond acceptors and donors (our article explains how these were derived).  

Table 1. Polarity of HB acceptors and donors estimated from measured alkane/water partition coefficient and molecular surface area

The results in Table 1 show that HB donors are typically more easily pulled out of water than HB acceptors and this can be seen as another hydrogen bonding asymmetry. This appears to go against the folklore that HB donors are somehow worse than HB acceptors from the perspective of drug-likeness. The polarity values for the NH (0.8) and carbonyl O (6.8) of the amide group may have some relevance to protein folding. This a good place to wrap up and I'll conclude by noting that, in the supplemental information for our article, you'll find an archive that contains files (in plain text format) of measured values for logPalk, hydrogen bond basicity and pKa that we extracted from the literature (DOI links are included). Here are some more photos from Belarus.  

Até mais!

Flora and fauna of Belovezhskaya Pushcha

Sunday, 30 September 2018

Hydrogen bonding asymmetries

Next >>

Have you ever wondered why the Rule of 5 (Ro5) specifies hydrogen bond (HB) thresholds of 10 acceptors but only 5 donors? This is, perhaps, the prototypical example of what I'll call a 'hydrogen bonding asymmetry' and it is sometimes invoked in support of the folklore that HB donors are somehow 'worse' than HB acceptors in drug design. I have, on occasion, tried to track down the source of this folklore but that trail has always gone cold on me. In any case, I don't think the HB asymmetry in Ro5 has any physical significance since HB acceptors (especially as defined for Ro5) tend to be more common in chemical structures of interest to medicinal chemists than HB donors. This was discussed in our correlation inflation article and the bigger Ro5 question for me is why the high polarity limit is defined by counts of HB donors and acceptors while the low polarity limit is defined in terms of lipophilicity. As may become a blogging habit, I'll include some random photos (these are from a visit to India late in 2013) to break up the text a bit. 

Drum fest at Buland Darwaza

It was this article in JCAMD about the 'polarized' nature of protein-ligand interfaces that got me thinking again about hydrogen bonding asymmetries. The study found that proteins donate twice as many HBs as they accepted. While the observation is certainly interesting, I do think that the authors might be over-interpreting it. For example, the authors suggest that it appears to be an underlying explanation for Ro5 and they may find that there are significant differences in their definitions of HB acceptors and those used to apply Ro5. The authors also state "Peptidyl ligands, on the other hand, showed no strong preference for donating versus accepting H-bonds". This observation would more be consistent with 'polarization' of protein-ligand interfaces being determined by nature of the ligand.

The authors assert that "lone pairs available to accept H-bonds are actually 1.6 times as prevalent as protons available to donate, both on the protein and ligand side of the interface." While it is appropriate to count lone pairs in situations where only one lone pair accepts an HB (e.g. when considering 1:1 hydrogen bonded complexes in low polarity solvents), I would argue that it is not appropriate to do so when considering biomolecular recognition in aqueous media because the acceptance of an HB by one oxygen lone pair makes the other lone pair less able to accept an HB. You can see this effect using molecular electrostatic potential as discussed in this article (see polarization effects section and Table 4). Put another way, how often is a carbonyl oxygen observed to accept two HBs from a binding partner? How many docking tools would explictly penalize a pose in which a carbonyl oxygen accepted two HBs?

As I see it, a typical protein is more likely to have a surplus of HB donors under normal physiological conditions. Some parts (e.g. serine, threonine, tyrosine and histidine side chains and the backbone) of a protein can be regarded as having equal numbers of HB donor and acceptor atoms. While the anionic side chains of aspartate and glutamate cannot donate HBs, the cationic side chains of arginine and lysine have five and three donor hydrogen atoms respectively while lacking HB acceptors. The tryptophan side chain has only a single HB donor (although its p-system is likely to be able to accept HBs) while each side chain of aspargine and glutamine has two donor hydrogen atoms and one acceptor oxygen atom. The histidine side chain is sometimes observed to be protonated in X-ray crystal structures which means that it should be considered to be more HB donor than HB acceptor in the constext of protein-ligand recognition. The tyrosine hydroxyl would be expected to be a stronger HB donor (and weaker HB acceptor) than the hydroxyls of either serine or threonine.  

A magical place

The study considers the "possibility is that nature avoids the presence of chemical groups bearing both H-bond donor and acceptor capacity, such as hydroxyl groups, in the binding sites of proteins or ligands" although it is not clear what glycobiologists would have to say about this. Let's think a bit about what happens when a hydroxyl group donates its hydrogen atom. Let's suppose you've spotted a nice juicy hydrogen bond acceptor at the bottom of a deep binding pocket that is otherwise hydrophobic. The ligandability is eye-wateringly awesome (the ligandometer is beeping loudly and appears to have gone into dynamic range overload). Even the tiresome Mothers Against Molecular Obesity (MAMO) are impressed and have recommended that you deploy a hydroxyl group since this will be great for property forecast index (PFI). What could possibly go wrong?

The main problem is that the hydroxyl HB donor comes with baggage. In order to donate an HB to the acceptor at the bottom of that pocket, you're going to need to force an HB acceptor into contact with the non-polar part of that binding pocket. Although this contact is not inherently repulsive, it is destabilizing. Another factor is that donation of an HB by the hydroxyl group is likely to increase the HB basicity of the oxygen (which will exacerbate the problem). You can think of other neutral HB donors (e.g. amide NH) but the vast majority of them come with baggage the form of an accompanying HB acceptor. Exceptions such as NH in pyrrole (not renowned for stability) and indole (steric demands) come with baggage of their own. In contrast, the drug designer has access to a diverse set (e.g. heteroaromatic N, nitrile N, tertiary amide O, sulfoxide O, ether O) of HB acceptors that are not accompanied by HB donors. If you use one of these, you don't have the problem of having to also accommodate a ligand HB donor.

This is a good place to wrap up. In the next post, I'll talk about a completely different type of hydrogen bonding asymmetry, but for now, I'll leave you with some photos from an afternoon spent admiring asses in the Rann of Kutch. 

Até mais!