Molecular Design: September 2024

I'll open the post with a panorama from the summit of Shutlingsloe, sometimes referred to as Cheshire's Matterhorn, which at 506 m above sea level, is the third highest point in the county. When in the UK, I usually come here to mark the solstices and there's usually a good crowd here for the occasion (the winter solstice tends to be less well attended).

The LR2024 study (Combining IC₅₀or K_i Values from Different Sources Is a Source of Significant Noise) that I’ll be discussing in this post highlights one of the issues that you’re likely to encounter should as you be using public domain databases such as ChEMBL to create datasets for building machine learning (ML) models for biological activity. The LR2024 study has already been reviewed in a Practical Fragments post (The limits of published data) and, using the same reference numbers as were used in the study, I’ll also mention 10 (The Experimental Uncertainty of Heterogeneous Public K_i Data) and 11 (Comparability of Mixed IC₅₀ Data – A Statistical Analysis). The variability in biological activity data highlighted by LR2024 stems in part from the fact that the term IC₅₀ may refer to different quantities even when measurements are performed for the same target and inhibitor/ligand (the issue doesn’t entirely disappear when you use K_i values). I have two general concerns with the analysis LR2024 study. First, it is unclear whether the ChEMBL curation process captures assay conditions in sufficient detail to enable the user to establish that two IC₅₀ values can be regarded as replicates of the same experiment (I stress that this is not a criticism of the curation process). Second, combining data for different pairs of assays for calculation of correlation-based measures of assay compatibility can lead to correlation inflation. One minor gripe that I do have with the LR2024 study concerns the use of the term “noise” which, in my view, should only refer to variation in values measured under identical conditions.

I'll review LR2024 in the first part of the post before discussing points not covered by the study such as irreversible inhibition and assay interference (these can cause systematic differences in IC₅₀ values to be observed for a particular combination of target and inhibitor even when the assays use the same substrate at the same concentration). There will be a follow up post covering how I would assemble data sets for building ML models for biological activity with some thoughts on assessment and curation of published biological activity data. As is usual for blog posts here at Molecular Design, quoted text is indented with my comments enclosed in square brackets in red italics.

In the Compatibility Issues section the authors state:

Looking beyond laboratory-to-laboratory variability of assays that are nominally the same, there are numerous reasons why literature results for different assays measured against the same “target” may not be comparable. These include the following:

Different assay conditions: these can include different buffers, experimental pH, temperature, and duration. [Biochemical assays are usually run at human body temperature (37°C) although assay temperature is not always reported. The term 'duration' is pertinent to irreversible inhibition and one has to be very careful when comparing IC₅₀ values for irreversible inhibitors. It's worth mentioning that a significant reduction in activity when an assay is run in the presence of detergent (see FS2006) is diagnostic of inhibition by colloidal aggregates (see McG2003). I categorized inhibition of this nature as “type 2 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial.]
Substrate identity and concentration: these are particularly relevant for IC₅₀ values from competition assays, where the identity and concentration of the substrate being competed with play an important role in determining the results. K_i measures the binding affinity of a ligand to an enzyme and so its values are, in principle, not sensitive to the identity or concentration of the substrate. [My view is that one would generally need to establish that IC₅₀ values had been determined using the same substrate and same substrate concentration if interpreting variation in the IC₅₀ values as "noise" and it's not clear that the substrate-related information needed to establish the comparability of IC₅₀ determinations is currently stored in ChEMBL. If concentrations and K_m values are known it may be practical to use the Cheng Prusoff equation (see CP1973) to combine IC₅₀ values measured that have been measured using different concentrations of substrate (or cofactor). It's worth noting that enzyme inhibition studies are commonly run with the substrate concentration at its K_m value (see Assay Guidance Manual: Basics of Enzymatic Assays for HTS NBK92007) and there is a good chance that assays against a target using a particular substrate will have been run using very similar concentrations of the substrate. It is important to be specially careful when analysing kinase IC₅₀ data because assays are sometimes run at high ATP concentration in order to simulate intracellular conditions (see GG2021).]
Different assay technologies: since typical biochemical assays do not directly measure ligand–protein binding, the idiosrasies of different assay technologies can lead to different results for the same ligand–protein pair. (7) [Significant differences in IC₅₀ (or K_i) values measured for a particular combination of target and compound using different assay read-outs are indicative of interference and I’ll discuss this point in more detail later in the post.]
Mode of action for receptors: EC₅₀ values can correspond to agonism, antagonism, inverse agonism, etc. [The difficulty here stems from not being able to fully characterize the activity in terms of a concentration response (for example, agonists are characterised by both affinity and efficacy).]

The situation is further complicated when working with databases like ChEMBL, which curate literature data sets:

Different targets: different variants of the same parent protein are assigned the same target ID in ChEMBL [My view is that one needs to be absolutely certain that assays have been performed using identical (including with respect to post-translational modifications) targets before interpreting differences in IC₅₀or K_i values as noise or experimental error.]
Different assay organism or cell types: the target protein may be recombinantly expressed in different cell types (the target ID in ChEMBL is assigned based on the original source of the target), or the assays may be run using different cell types. [There does appear to be some confusion here and it would not generally be valid to valid to assign a ChEMBL target ID to a cell-based assay.]
Any data source can contain human errors like transcription errors or reporting incorrect units. These may be present in the original publication─when the authors report the wrong units or include results from other publications with the wrong units─or introduced during the data extraction process.

The authors describe a number of metrics for quantifying compatibility of pairs of assays in the Methods section of LR2024. My view is that compatibility between assays should be quantified in terms of differences between pIC₅₀ (or pK_i) values and I consider correlation-based metrics to be less useful for this purpose. The degree to which pIC₅₀ values for two assays run against a target are correlated reflects the (random) noise in each assay and the range (more accurately the variance) in the pIC₅₀ values measured for all the compounds in each assay. Let’s consider a couple of scenarios. First, results from two assays are highly correlated but significantly offset from each other to a consistent extent (the assays might, for example, measure IC₅₀ for a particular target using different substrates). Under this scenario it would be valid to include results from both assays in a single analysis (for example, by using the observed offset between pIC₅₀ values as a correction factor) even though it would not be valid to treat the pIC₅₀ values for compounds in the two assays as equivalent. In the second scenario, the correlation between the assays is limited by the narrowness of the range in the IC₅₀ values measured for the compounds in the two assays. Under this scenario, differences between the pIC₅₀ values measured for each compound can still be used to assess the compatibility of the two assays even though the range in the IC₅₀ values is too narrow for a correlation-based metric to be useful.

The compatibility between the two assays was measured by comparing pchembl values of overlapping compounds. [The term pchembl does need to be defined.] In addition to plotting the values, a number of metrics were used to quantify the degree of compatibility between assay pairs:

R²: the coefficient of determination provides a direct measure of how well the “duplicate” values in the two assays agree with each other. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
Kendall τ: nonparametric measure of how equivalent the rankings of the measurements in the two assays are. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
f > 0.3: fraction of the pairs where the difference is above the estimated experimental error. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC₅₀ values is greater than the uncertainty in either pIC₅₀ value (an uncertainty of 0.3 in ΔpIC50 would correspond to an uncertainty of 0.2 in each of the IC₅₀ values from which the difference had been calculated.]
f > 1.0: fraction of the pairs where the difference is more than one log unit. This is an arbitrary limit for a truly meaningful activity difference. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC₅₀ values is greater than the uncertainty in either pIC₅₀ value (an uncertainty of 1.0 in ΔpIC₅₀ would correspond to an uncertainty of 0.7 in each of the IC₅₀ values from which the difference had been calculated.]
κbin: Cohen’s κ calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous data prior to assessment of correlations because the operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]
MCCbin: Matthew’s correlation coefficient calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous prior to assessment of correlations because this operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]

Let’s take a look at some of the results reported in the LR2024 study and it’s interesting that f > 0.3 and f > 1.0 values were comparable for IC₅₀ and K_i measurements. This is an important result since K_i values do not depend on the concentration and K_m of the substrate (or cofactor) and I would generally anticipate greater variation in IC₅₀ values measured for each compound-target pair than for the corresponding K_i values.

We first looked at the variation in the data sets when IC₅₀ assays are combined using “only activity” curation (top panels in Figure 2). The noise level in this case is very high: 64% of the Δpchembl values are greater than 0.3, and 27% are greater than 1.0. The analogous plot for the K_i data sets is shown in Figure S1 in the Supporting Information. The noise level for K_i is comparable: 67% of the Δpchembl values are greater than 0.3, and 30% are greater than 1.0.

I consider it valid to combine data for different pairs of assays for analysis of ΔpIC₅₀ or ΔpK_i values. However, I have significant concerns about the validity of combining data for different pairs of assays for analysis of correlations between pIC₅₀ or pK_i values. The authors of LR2024 state:

In Figure 2 and all similar plots in this study, the points are plotted such that the assay on the x-axis has a higher assay_id (this is the assay key in the SQL database, not the assay ChEMBL ID that is more familiar to users of the ChEMBL web interface) in ChEMBL32 than the assay on the y-axis. Given that assay_ids are assigned sequentially in the ChEMBL database, this means that the x-value of each point is most likely from a more recent publication than the y-value. We do not believe that this fact introduces any significant bias into our analysis.

I see two problems (one minor and one major) in preparing data in this manner for plotting and analysis of correlations over a number of assay pairs. The minor problem is that exchanging assay1 with assay2 for some of the assay pairs will generally result in different values for the correlation-based metrics for compatibility of assays. While I don’t anticipate that the differences would be large the value of a correlation-based metric for assay compatibility really shouldn’t depend on the ordering of the assays. Furthermore, the issue can be resolved by symmetrizing the dataset so that each of the pair of assay results for each compound is included both as the x-value and as the y-value. Symmetrizing the dataset in this manner doubles the number of data points and one would need to be careful if estimating confidence intervals for the correlation-based metrics for assay compatibility. I think that it would be appropriate apply a weight of 0.5 to each data point for estimation of confidence intervals although I would certainly be consulting a statistician before doing this.

However, there is also another problem (which I don't consider to be minor) with combining data for assay pairs when analysing correlations. The value of a correlation-based metric for assay compatibility reflects the variance in ΔpIC₅₀ (or ΔpK_i) values and the variance in the pIC₅₀ (or pK_i) values. The variance in pIC₅₀ (or pK_i) values when different pairs of assays that have been combined would generally be expected to be greater than for the datasets corresponding to the individual assay pairs. Under this scenario I believe that it would be accurate to describe the correlation metrics calculated for the aggregated data as inflated (see KM2013 and the comments made therein on the HMO2016 , LS2007 and LBH2009 studies) and as a reviewer of the manuscript I would have suggested that the distribution over all assay pairs be shown for each correlation-based assay compatibility metric. When considering correlations between assays it can also be helpful, although not strictly correct, to think in terms of ranges in pIC₅₀ values. For example, the range in pIC₅₀ values for “only activity curation” in Figure 2 appears to be about 7 log units (I’d be extremely surprised if the range in pIC₅₀ values for any of the individual assays even approached this figure). My view is that correlation-based metrics are not meaningful when data for multiple pairs of assays have been combined although I don't think any real harm has been done given that the authors certainly weren't trying to 'talk up' strengths of trends on the basis of the values of the correlation-based metrics. However, there is a scenario under which this type of correlation inflation would be a much bigger problem and that would be when using measures of correlation to compare measured ΔG values with values that had been calculated by free energy perturbation using different reference compounds.

So far in the post the focus has been on the analysis presented in LR2024 and now I’ll change direction by discussing a couple of topics that were not covered in that study. I’ll start by looking at irreversible mechanisms of action and the (S2017 | McW2021 | H12024) articles cover irreversible covalent inhibition (this is the irreversible mechanism of action that ChEMBL users are most likely encounter). You need two parameters to characterize irreversible covalent inhibition (K_i and k_inact respectively quantify the affinity of the ligand for target and the rate at which the non-covalently bound ligand becomes covalently bound to target). While it is common to encounter IC₅₀ values in the literature for irreversible covalent inhibitors these are not true concentration responses because the IC₅₀ values also depend on factors such as pre-incubation time. Another difficulty is that articles reporting IC₅₀ values for irreversible covalent inhibitors don’t always explicitly state that the inhibition is irreversible.

As the authors of LR2024 correctly note differences between IC₅₀ values may be the result of using different assay technologies. Interference with assay read-out (I categorized this as “type 1 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial) should always be considered as a potential explanation for significant differences between IC₅₀ values measured for a given combination of target and inhibitor when different assay technologies are used. An article that I recommend for learning more about this problem is SWK2009 which explains how UV/Vis absorption and fluorescence by inhibitors can cause interference with assay read-outs (the study also shows how interference can be assessed and even corrected for). When examining differences between IC₅₀ values for the same combination of target and inhibitor it's worth bearing in mind that interference with assay read-outs tends to be more of an issue at high concentration (this is why biophysical assays tend to be favored for screening fragments). From the data analysis perspective, it’s usually safe to assume that enzyme inhibition assays using the same substrate also use the same type of assay read-out.

Differences in the technology used to prepare the solutions for assays is another potential cause of variation in IC₅₀ values. For example, a 2010 AstraZeneca patent (US7718653B2) disclosed significant differences in IC₅₀ values depending on whether acoustic dispensing or serial dilution was used for preparation of solutions for assay. Compounds were observed to be more potent when acoustic dispensing was used and the differences in IC₅₀ values point to an aqueous solubility issue. The data in US7718653B2 formed the basis for the EOW2013 study.

So that brings us to the end of my review of the LR2024 study and I’ll be doing a follow up post later in the year. One big difficulty in analysing differences between measured quantities is determining the extent to which measured values are directly comparable when IC₅₀ can be influenced by factors such as the technology used to prepare assay solutions. Something that I think would have been worth investigating is the extent to which variability of measured values depends on potency (pIC₅₀ measurements might be inherently more variable for less potent inhibitors than for highly potent inhibitors). The most serious criticism that I would make of LR2024 is it is not meaningful to combine data for different pairs of assays when calculating correlation-based measures of assay compatibility.

Molecular Design

Wednesday, 18 September 2024

Variability in biological activity measurements reported in the drug discovery literature