I will never call myself an expert for to do so would be to
take the first step down a very slippery slope.
It is important to remember that each and every expert has both an applicability
domain and a shelf life. It’s not the
experts themselves to which I object but what I’ll call ‘expert-driven
decision-making’ and the idea that you can simply delegate your thinking to somebody
else.
I’m going to take a look at an article that describes
attempts to model an expert’s evaluation of chemical probes identified by NIH-funded
high throughput screening. The study covers some old ground and one issue facing authors of articles like these is how to deal
with the experts in what might be termed a ‘materials and methods
context’. Experts who are also coauthors of studies like
these become, to some extent, self-certified experts. One observation that can be made about this
article is that its authors appear somewhat preoccupied with the funding levels
for the NIH chemical probe initiative and I couldn't help wondering about how this might have shaped or influenced
the analysis.
A number of approaches to modeling the expert’s assessment
of the 322 probe compounds of which 79% were considered to be desirable. The approaches used by the authors ranged
from simple molecular property filtering to more sophisticated machine learning
models. I noted two errors (missing
molar energy units; taking logarithm of quantity with units) in the formula for
ligand efficiency and it’s a shame they didn’t see our article on ligand
efficiency metrics which became available online about six weeks before they
submitted their article (the three post series starting here may be helpful). The authors state, “PAINS is a set of filters
determined by identifying compounds that were frequent hitters in numerous high
throughput screens” which is pushing things a bit because the PAINS filters
were actually derived from analysis of the output from six high throughput
screening campaigns (this is discussed in detail in three-post series that starts here). Mean pKa values of 2.25 (undesirable compounds)
and 3.75 (desirable compounds) were reported for basic compounds and it certainly wasn’t
clear to me how compounds were deemed to be basic given that these values are
well below neutral pH. In general, one needs to be very careful when averaging
pKa values. While these observations
might be seen as nit-picking, using terms like ‘expert’, ‘validation’ and ‘due
diligence’ in title and abstract does set the bar high.
A number of machine learning models were described and
compared in the article and it’s worth saying something about models like
these. A machine learning model is
usually the result of an optimization process.
When we build a machine learning model, we search for a set of
parameters that optimizes an objective such as fit or discrimination for the
data with which we train the model. The
parameters may be simple coefficients for parameters (like in a regression
model) but they might also be threshold values for rules. The more parameters you use to build a model
(machine learning or otherwise), the more highly optimized the resulting model
will be and we use the term ‘degrees of freedom’ to say how many parameters
we’ve used when training the model. You
have to be very careful when comparing models that have different numbers of degrees
of freedom associated with them and one criticism that I would make of machine learning models is that the number of degrees of freedom is rarely (if ever)
given. Over-fitting is always a concern
with models and it is customary to validate machine learning models using one
or more of a variety of protocols. Once a machine learning model has been validated,
number of degrees of freedom is typically considered to be a non-issue.
Clustering in data can cause validation to make optimistic assessments of model
quality and the predictive chemistry community does need to pay more attention
to Design of Experiments . Here’s a slide
that I sometimes use in molecular design talks.
Let’s get back to the machine learning models in the
featured article. Comparisons were made
between models (see Figure 4 and Table 5 in the article) but no mention is made
of numbers of degrees of freedom for the models. I took a look in the supplementary
information to see if I could get this information by looking at the models themselves
and discovered that the models had not actually been reported. In fact, the
expert’s assessment of the probes had not been reported either and I don't believe that this article scores highly for either reproducibility or openness.
Had this come to me as a manuscript reviewer,
the response would have been swift and decisive and you probably wouldn’t be reading
this blog post. How much weight should those responsible for NIH chemical probes initiative give to the study? I’d say they can safely ignore it because the
data set is proprietary and models trained on it are only described and not
actually specified. Had the expert's opinion on the desirability (or otherwise) of the probes been disclosed then it would have been imprudent for the NIH folk to ignore what the expert had to say. At the same time, it's worth remembering that we seek different things from probes and from drugs and one expert's generic opinion of a probe needs to placed in the context of any specific design associated with the probe's selection.
However, there are other reasons that the NIH chemical probes folk might want to be wary of data analysis from this source and I'll say a bit more about these in the next post.
However, there are other reasons that the NIH chemical probes folk might want to be wary of data analysis from this source and I'll say a bit more about these in the next post.