Monday, 29 July 2013

Some reflections on computer-aided drug design (after attending CADD Gordon conference)

I’ve just returned to Trinidad where I’ve been spending the summer.  I was in the USA at the Computer-Aided Drug Design (CADD) Gordon Conference (GRC) organized by Anthony Nicholls and Martin Stahl.  The first thing that I should say is that this will not be a conference report because what goes on at all Gordon Conferences is confidential and off-record. This is intended to make discussions freer and less inhibited and you won’t see proceedings of GRCs published.  Nevertheless, the conference program is available online so I think that it’ll be OK to share some general views of the CADD field (which have been influenced by what I picked up at the conference) even though commenting on specific lectures, posters or discussions would be verboten.

The focus of the conference was Statistics in CADD and the stuff that I took most notice of was the use of Baysian methods although I still need to get my head round things a bit more. Although modelling affinity/potency and output of ADMET assays (e.g. hERG blockade) tends to dominate the thinking of CADD scientists, the links between these properties and clinical outcomes in live humans are not as strong as many assume.  Could the wisdom of Rev Bayes be applied to PK/PD modeling and the design of clinical trials?  I couldn’t help worrying about closet Bayesians in search of a good posterior and what would be the best way to quantify the oomph of a ROC curve...

Reproducibility is something that needs to be addressed in CADD studies and if we are to improve this we must be prepared to share both data and methods (e.g. software).  This open access article should give you an idea of some of the issues and directions in which we need to head. Journal editors have a part to play here and must resist the temptation to publish retrospective analyses of large proprietary data sets because of the numbers of citations that they generate.  At the same time, journal editors should not be blamed for supplemental information ending up in PDF format.  For example, I had no problems (I just asked) getting JMC (2008), JCIM (2009) and JCAMD (2013 and 2013) to publish supplemental information in text (or zipped text) format.

When you build models from data, it is helpful to think of signal and noise.  The noise can be thought of as coming from both the model and from the data and in some cases it may be possible to resolve it into these two components.  The function of Statistics is to provide an objective measure of the relative magnitudes of signal and noise but you can’t use Statistics to make noise go away (not that this stops people from trying).  Molecular design can be defined as control of behavior of compounds and materials by manipulation of molecular properties and can be thought of as being prediction-driven or hypothesis-driven.   Prediction-driven molecular design is about building predictive models but it is worth remembering that a much (most?) pharmaceutical design involves a significant hypothesis-driven component. One way of thinking about hypothesis-driven molecular design is as framework for assembling structure activity/property relationships (SAR/SPR) as efficiently as possible but this is not something that statistical methodology currently appears equipped to do particularly well.

The conference has its own hashtag (#grccadd) and appeared to out-tweet the Sheffield Cheminformatics conference which ran concurrently.  Some speakers have shared their talks publically and a package of statistical tools created especially for the conference is available online  

Literature cited and links to talks
WP Walters (2013) Modeling, informatics, and the quest for reproducibility. JCIM 53:1529-1530 DOI

CC Chow, Bayesian and MCMC methods for parameter estimation and model comparison. Link

N Baker, The importance of metadata in preserving and reusing scientific information Link

PW Kenny, Tales of correlation inflation.  Link

CADD Stat Link

Sunday, 14 July 2013

Prediction of alkane/water partition coefficients

Those of you who follow this blog will know that I have a long standing interest in alkane/water partition coefficient and I’d like to tell you a bit about the ClogPalk model for predicting these from molecular structure that we published during my time in Brasil. Some years ago we explored prediction of ΔlogP (logPoct - logPalk) from calculated molecular electrostatic potentials and this can be thought of as treating the alkane/water partition coefficient as a perturbation of the octanol/water partition coefficient.  One disadvantage of this approach is that it requires access to logPoct and I was keen to explore other avenues.  The correlation of logPalk with computed molecular surface area (MSA) is excellent for saturated hydrocarbons and I wondered if this class of compound might represent a suitable reference state for another type of perturbation model.  Have a look at Fig 1 which shows plots of logPalk against MSA for saturated hydrocarbons (green), aliphatic alcohols (red) and aliphatic diols (blue).  You can see how adding a single hydroxyl group to a saturated hydrocarbon shifts logPalk down by about 4.5 units and adding two hydroxyl groups shifts logPalk further still.

The perturbations are defined substructurally using SMARTS notation. Specifically, each perturbation term consists of a SMARTS definition for the relevant functional group and a decrement term (e.g. 4.5 units for alcohol hydroxyl).  The model also allows functional groups to interact with each other.  For example, an intramolecular hydrogen bond ‘absorbs’ some of a molecule’s polarity and manifests itself as an unexpectedly high logPalk value.  Take a look at this article if you’re interested in this sort of thing.  The interaction terms can be thought of as perturbations of perturbations. The ClogPalk model is shown in Fig 2.

The performance of the model against external test data is shown in Figure 3.  There do appear to be some issues with some of the data and measured values of logPalk were found to differ by two or more units for some compounds (Atropine, Propanolol, Papavarine).  Also there are concerns about the self-consistency of the measurements for Cortexolone, Cortisone and Hydrocortisone. Specifically, the logPalk of Cortexolone (-1.00) is actually lower than that for its keto analogue Cortisone (-0.55).
The software was built using OpenEye programming toolkits (OEChem and Spicoli) and you’ll find the source code and makefiles in the supplementary information with all the data used to parameterize and test the models. It’s not completely open source because you’ll need a license from OpenEye to actually run the software.  However, the documentation for the toolkits is freely available online and you may be even able to get an evaluation license to see how things work.  You’ll also find the source code for SSProFilter in the supplemental material and this is an improved (it also profiles) version of the Filter program that I put together with the Daylight toolkit back in 1996. Very useful for designing screening libraries and you might want to take a look at this post on SMARTS from a couple of years ago.

There's some general discussion in the article that is not specific to the ClogPalk model and I'll mention it briefly since I think this is relevant to molecular design. Those of you who believe that the octanol/water partition coefficient is somehow fundamental might like to trace how we ended up with this particular partitioning system.  We also address the question of whether logP or logD is the more appropriate measure of lipophilicity measure and some ligand efficiency stuff from an earlier post makes its journal debut.  
That’s about all I wanted to say for now and I’ll finish by noting that the manuscript was originally submitted to another journal but that's going to be the subject of a post all of its very own...
Literature cited
Toulmin, Wood, Kenny (2008) Toward prediction of alkane/water partition coefficients. J Med Chem 51:3720-3730 DOI
Kenny, Montanari, Prokopczyk (2013)ClogPalk: A method for predicting alkane/water partition coefficient. JCAMD 27:389-402 DOI