Monday, 13 April 2015


I will never call myself an expert for to do so would be to take the first step down a very slippery slope.  It is important to remember that each and every expert has both an applicability domain and a shelf life.  It’s not the experts themselves to which I object but what I’ll call ‘expert-driven decision-making’ and the idea that you can simply delegate your thinking to somebody else.

I’m going to take a look at an article that describes attempts to model an expert’s evaluation of chemical probes identified by NIH-funded high throughput screening.  The study covers some old ground and one issue facing authors of articles like these is how to deal with the experts in what might be termed a ‘materials and methods context’.   Experts who are also coauthors of studies like these become, to some extent, self-certified experts.   One observation that can be made about this article is that its authors appear somewhat preoccupied with the funding levels for the NIH chemical probe initiative and I couldn't help wondering about how this might have shaped or influenced the analysis.

A number of approaches to modeling the expert’s assessment of the 322 probe compounds of which 79% were considered to be desirable.   The approaches used by the authors ranged from simple molecular property filtering to more sophisticated machine learning models.  I noted two errors (missing molar energy units; taking logarithm of quantity with units) in the formula for ligand efficiency and it’s a shame they didn’t see our article on ligand efficiency metrics which became available online about six weeks before they submitted their article (the three post series starting here may be helpful).  The authors state, “PAINS is a set of filters determined by identifying compounds that were frequent hitters in numerous high throughput screens” which is pushing things a bit because the PAINS filters were actually derived from analysis of the output from six high throughput screening campaigns (this is discussed in detail in three-post series that starts here).  Mean pKa values of 2.25 (undesirable compounds) and 3.75 (desirable compounds) were reported for basic compounds and it certainly wasn’t clear to me how compounds were deemed to be basic given that these values are well below neutral pH. In general, one needs to be very careful when averaging pKa values.  While these observations might be seen as nit-picking, using terms like ‘expert’, ‘validation’ and ‘due diligence’ in title and abstract does set the bar high.

A number of machine learning models were described and compared in the article and it’s worth saying something about models like these.  A machine learning model is usually the result of an optimization process.  When we build a machine learning model, we search for a set of parameters that optimizes an objective such as fit or discrimination for the data with which we train the model.  The parameters may be simple coefficients for parameters (like in a regression model) but they might also be threshold values for rules.  The more parameters you use to build a model (machine learning or otherwise), the more highly optimized the resulting model will be and we use the term ‘degrees of freedom’ to say how many parameters we’ve used when training the model.   You have to be very careful when comparing models that have different numbers of degrees of freedom associated with them and one criticism that I would make of machine learning models is that the number of degrees of freedom is rarely (if ever) given.  Over-fitting is always a concern with models and it is customary to validate machine learning models using one or more of a variety of protocols. Once a machine learning model has been validated, number of degrees of freedom is typically considered to be a non-issue. Clustering in data can cause validation to make optimistic assessments of model quality and the predictive chemistry community does need to pay more attention to Design of Experiments. Here’s a slide that I sometimes use in molecular design talks.

Let’s get back to the machine learning models in the featured article.  Comparisons were made between models (see Figure 4 and Table 5 in the article) but no mention is made of numbers of degrees of freedom for the models.   I took a look in the supplementary information to see if I could get this information by looking at the models themselves and discovered that the models had not actually been reported. In fact, the expert’s assessment of the probes had not been reported either and I don't believe that this article scores highly for either reproducibility or openness.   Had this come to me as a manuscript reviewer, the response would have been swift and decisive and you probably wouldn’t be reading this blog post. How much weight should those responsible for NIH chemical probes initiative give to the study?  I’d say they can safely ignore it because the data set is proprietary and models trained on it are only described and not actually specified.  Had the expert's opinion on the desirability (or otherwise) of the probes been disclosed then it would have been imprudent for the NIH folk to ignore what the expert had to say. At the same time, it's worth remembering that we seek different things from probes and from drugs and one expert's generic opinion of a probe needs to placed in the context of any specific design associated with the probe's selection.

However, there are other reasons that the NIH chemical probes folk might want to be wary of data analysis from this source and I'll say a bit more about these in the next post.


No comments: