Wednesday 18 September 2024

Variability in biological activity measurements reported in the drug discovery literature

I'll open the post with a panorama from the summit of Shutlingsloe, sometimes referred to as Cheshire's Matterhorn, which at 506 m above sea level, is the third highest point in the county. When in the UK, I usually come here to mark the solstices and there's usually a good crowd here for the occasion (the winter solstice tends to be less well attended). 

  

The LR2024 study (Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise) that I’ll be discussing in this post highlights one of the issues that you’re likely to encounter should as you be using public domain databases such as ChEMBL to create datasets for building machine learning (ML) models for biological activity. The LR2024 study has already been reviewed in a Practical Fragments post (The limits of published data) and, using the same reference numbers as were used in the study,  I’ll also mention 10 (The Experimental Uncertainty of Heterogeneous Public Ki Data) and 11 (Comparability of Mixed IC50 Data – A Statistical Analysis). The variability in biological activity data highlighted by LR2024 stems in part from the fact that the term IC50 may refer to different quantities even when measurements are performed for the same target and inhibitor/ligand (the issue doesn’t entirely disappear when you use Ki values). I have two general concerns with the analysis LR2024 study. First, it is unclear whether the ChEMBL curation process captures assay conditions in sufficient detail to enable the user to establish that two IC50 values can be regarded as replicates of the same experiment (I stress that this is not a criticism of the curation process).  Second, combining data for different pairs of assays for calculation of correlation-based measures of assay compatibility can lead to correlation inflation. One minor gripe that I do have with the LR2024 study concerns the use of the term “noise” which, in my view, should only refer to variation in values measured under identical conditions.

I'll review LR2024 in the first part of the post before discussing points not covered by the study such as irreversible inhibition and assay interference (these can cause systematic differences in IC50 values to be observed for a particular combination of target and inhibitor even when the assays use the same substrate at the same concentration). There will be a follow up post covering how I would assemble data sets for building ML models for biological activity with some thoughts on assessment and curation of published biological activity data. As is usual for blog posts here at Molecular Design, quoted text is indented with my comments enclosed in square brackets in red italics.

In the Compatibility Issues section the authors state:

Looking beyond laboratory-to-laboratory variability of assays that are nominally the same, there are numerous reasons why literature results for different assays measured against the same “target” may not be comparable. These include the following:

  1. Different assay conditions: these can include different buffers, experimental pH, temperature, and duration. [Biochemical assays are usually run at human body temperature (37°C) although assay temperature is not always reported. The term 'duration' is pertinent to irreversible inhibition and one has to be very careful when comparing IC50 values for irreversible inhibitors. It's worth mentioning that a significant reduction in activity when an assay is run in the presence of detergent (see FS2006) is diagnostic of inhibition by colloidal aggregates (see McG2003). I categorized inhibition of this nature as “type 2 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial.] 
  2. Substrate identity and concentration: these are particularly relevant for IC50 values from competition assays, where the identity and concentration of the substrate being competed with play an important role in determining the results. Ki measures the binding affinity of a ligand to an enzyme and so its values are, in principle, not sensitive to the identity or concentration of the substrate. [My view is that one would generally need to establish that IC50 values had been determined using the same substrate and same substrate concentration if interpreting variation in the IC50 values as "noise" and it's not clear that the substrate-related information needed to establish the comparability of IC50 determinations is currently stored in ChEMBL. If concentrations and Km values are known it may be practical to use the Cheng Prusoff equation (see CP1973) to combine IC50 values measured that have been measured using different concentrations of substrate (or cofactor). It's worth noting that enzyme inhibition studies are commonly run with the substrate concentration at its Km value (see Assay Guidance Manual: Basics of Enzymatic Assays for HTS NBK92007) and there is a good chance that assays against a target using a particular substrate will have been run using very similar concentrations of the substrate. It is important to be specially careful when analysing kinase IC50 data because assays are sometimes run at high ATP concentration in order to simulate intracellular conditions (see GG2021).]
  3. Different assay technologies: since typical biochemical assays do not directly measure ligand–protein binding, the idiosrasies of different assay technologies can lead to different results for the same ligand–protein pair. (7) [Significant differences in IC50 (or Ki) values measured for a particular combination of target and compound using different assay read-outs are indicative of interference and I’ll discuss this point in more detail later in the post.]
  4. Mode of action for receptors: EC50 values can correspond to agonism, antagonism, inverse agonism, etc.  [The difficulty here stems from not being able to fully characterize the activity in terms of a concentration response (for example, agonists are characterised by both affinity and efficacy).]

The situation is further complicated when working with databases like ChEMBL, which curate literature data sets:

  1. Different targets: different variants of the same parent protein are assigned the same target ID in ChEMBL [My view is that one needs to be absolutely certain that assays have been performed using identical (including with respect to post-translational modifications) targets before interpreting differences in IC50 or Ki values as noise or experimental error.] 
  2. Different assay organism or cell types: the target protein may be recombinantly expressed in different cell types (the target ID in ChEMBL is assigned based on the original source of the target), or the assays may be run using different cell types.  [There does appear to be some confusion here and it would not generally be valid to valid to assign a ChEMBL target ID to a cell-based assay.]  
  3. Any data source can contain human errors like transcription errors or reporting incorrect units. These may be present in the original publication─when the authors report the wrong units or include results from other publications with the wrong units─or introduced during the data extraction process.

The authors describe a number of metrics for quantifying compatibility of pairs of assays in the Methods section of LR2024.  My view is that compatibility between assays should be quantified in terms of differences between pIC50 (or pKi) values and I consider correlation-based metrics to be less useful for this purpose. The degree to which pIC50 values for two assays run against a target are correlated reflects the (random) noise in each assay and the range (more accurately the variance) in the pIC50 values measured for all the compounds in each assay.  Let’s consider a couple of scenarios.  First, results from two assays are highly correlated but significantly offset from each other to a consistent extent (the assays might, for example, measure IC50 for a particular target using different substrates). Under this scenario it would be valid to include results from both assays in a single analysis (for example, by using the observed offset between pIC50 values as a correction factor) even though it would not be valid to treat the pIC50 values for compounds in the two assays as equivalent. In the second scenario, the correlation between the assays is limited by the narrowness of the range in the IC50 values measured for the compounds in the two assays. Under this scenario, differences between the pIC50 values measured for each compound can still be used to assess the compatibility of the two assays even though the range in the IC50 values is too narrow for a correlation-based metric to be useful. 

The compatibility between the two assays was measured by comparing pchembl values of overlapping compounds. [The term pchembl does need to be defined.] In addition to plotting the values, a number of metrics were used to quantify the degree of compatibility between assay pairs:

  • R2: the coefficient of determination provides a direct measure of how well the “duplicate” values in the two assays agree with each other. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.] 
  • Kendall τ: nonparametric measure of how equivalent the rankings of the measurements in the two assays are. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph.]
  • f > 0.3: fraction of the pairs where the difference is above the estimated experimental error. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC50 values is greater than the uncertainty in either pIC50 value (an uncertainty of  0.3 in ΔpIC50 would correspond to an uncertainty of 0.2 in each of the IC50 values from which the difference had been  calculated.]
  • f > 1.0: fraction of the pairs where the difference is more than one log unit. This is an arbitrary limit for a truly meaningful activity difference. Smaller values correspond to higher compatibility. [The uncertainty in the difference between two pIC50 values is greater than the uncertainty in either pIC50 value (an uncertainty of  1.0 in ΔpIC50 would correspond to an uncertainty of 0.7 in each of the IC50 values from which the difference had been calculated.]
  • κbin: Cohen’s κ calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous data prior to assessment of correlations because the operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]
  • MCCbin: Matthew’s correlation coefficient calculated between the assays after binning their results into active and inactive using bin as the activity threshold. Values range from −1.0 to 1.0 with larger values corresponding to higher compatibility. [I’ve discussed limitations of correlation-based metrics for assessment compatibility of assays in the preceding paragraph. I generally advise against binning continuous prior to assessment of correlations because this operation discards information and the values of the correlation metrics vary with the scheme used to bin the data.]

Let’s take a look at some of the results reported in the LR2024 study and it’s interesting that f > 0.3 and f > 1.0 values were comparable for IC50 and Ki measurements. This is an important result since Ki values do not depend on the concentration and Km of the substrate (or cofactor) and I would generally anticipate greater variation in IC50 values measured for each compound-target pair than for the corresponding Ki values. 

We first looked at the variation in the data sets when IC50 assays are combined using “only activity” curation (top panels in Figure 2). The noise level in this case is very high: 64% of the Δpchembl values are greater than 0.3, and 27% are greater than 1.0. The analogous plot for the Ki data sets is shown in Figure S1 in the Supporting Information. The noise level for Ki is comparable: 67% of the Δpchembl values are greater than 0.3, and 30% are greater than 1.0.

I consider it valid to combine data for different pairs of assays for analysis of ΔpIC50 or ΔpKi values. However, I have significant concerns about the validity of combining data for different pairs of assays for analysis of correlations between pIC50 or pKi values. The authors of LR2024 state:  

In Figure 2 and all similar plots in this study, the points are plotted such that the assay on the x-axis has a higher assay_id (this is the assay key in the SQL database, not the assay ChEMBL ID that is more familiar to users of the ChEMBL web interface) in ChEMBL32 than the assay on the y-axis. Given that assay_ids are assigned sequentially in the ChEMBL database, this means that the x-value of each point is most likely from a more recent publication than the y-value. We do not believe that this fact introduces any significant bias into our analysis.

I see two problems (one minor and one major) in preparing data in this manner for plotting and analysis of correlations over a number of assay pairs. The minor problem is that exchanging assay1 with assay2 for some of the assay pairs will generally result in different values for the correlation-based metrics for compatibility of assays. While I don’t anticipate that the differences would be large the value of a correlation-based metric for assay compatibility really shouldn’t depend on the ordering of the assays. Furthermore, the issue can be resolved by symmetrizing the dataset so that each of the pair of assay results for each compound is included both as the x-value and as the y-value. Symmetrizing the dataset in this manner doubles the number of data points and one would need to be careful if estimating confidence intervals for the correlation-based metrics for assay compatibility. I think that it would be appropriate apply a weight of 0.5 to each data point for estimation of confidence intervals although I would certainly be consulting a statistician before doing this.

However, there is also another problem (which I don't consider to be minor) with combining data for assay pairs when analysing correlations. The value of a correlation-based metric for assay compatibility reflects the variance in ΔpIC50 (or ΔpKi) values and the variance in the pIC50 (or pKi) values. The variance in pIC50 (or pKi) values when different pairs of assays that have been combined would generally be expected to be greater than for the datasets corresponding to the individual assay pairs.  Under this scenario I believe that it would be accurate to describe the correlation metrics calculated for the aggregated data as inflated (see KM2013 and the comments made therein on the HMO2016 , LS2007 and LBH2009 studies) and as a reviewer of the manuscript I would have suggested that the distribution over all assay pairs be shown for each correlation-based assay compatibility metric. When considering correlations between assays it can also be helpful, although not strictly correct, to think in terms of ranges in pIC50 values. For example, the range in pIC50 values for “only activity curation” in Figure 2 appears to be about 7 log units (I’d be extremely surprised if the range in pIC50 values for any of the individual assays even approached this figure). My view is that correlation-based metrics are not meaningful when data for multiple pairs of assays have been combined although I don't think any real harm has been done given that the authors certainly weren't trying to 'talk up' strengths of trends on the basis of the values of the correlation-based metrics. However, there is a scenario under which this type of correlation inflation would be a much bigger problem and that would be when using measures of correlation to compare measured ΔG values with values that had been calculated by free energy perturbation using different reference compounds.

So far in the post the focus has been on the analysis presented in LR2024 and now I’ll change direction by discussing a couple of topics that were not covered in that study.  I’ll start by looking at irreversible mechanisms of action and the (S2017 | McW2021 | H12024) articles cover irreversible covalent inhibition (this is the irreversible mechanism of action that ChEMBL users are most likely encounter).  You need two parameters to characterize irreversible covalent inhibition (Ki and kinact respectively quantify the affinity of the ligand for target and the rate at which the non-covalently bound ligand becomes covalently bound to target). While it is common to encounter IC50 values in the literature for irreversible covalent inhibitors these are not true concentration responses because the IC50 values also depend on factors such as pre-incubation time. Another difficulty is that articles reporting IC50 values for irreversible covalent inhibitors don’t always explicitly state that the inhibition is irreversible.

As the authors of LR2024 correctly note differences between IC50 values may be the result of using different assay technologies. Interference with assay read-out (I categorized this as “type 1 behaviour” in a Comment on "The Ecstasy and Agony of Assay Interference Compounds" Editorial) should always be considered as a potential explanation for significant differences between IC50 values measured for a given combination of target and inhibitor when different assay technologies are used. An article that I recommend for learning more about this problem is SWK2009 which explains how UV/Vis absorption and fluorescence by inhibitors can cause interference with assay read-outs (the study also shows how interference can be assessed and even corrected for). When examining differences between IC50 values for the same combination of target and inhibitor it's worth bearing in mind that interference with assay read-outs tends to be more of an issue at high concentration (this is why biophysical assays tend to be favored for screening fragments). From the data analysis perspective, it’s usually safe to assume that enzyme inhibition assays using the same substrate also use the same type of assay read-out.

Differences in the technology used to prepare the solutions for assays is another potential cause of variation in IC50 values. For example, a 2010 AstraZeneca patent (US7718653B2) disclosed significant differences in IC50 values depending on whether acoustic dispensing or serial dilution was used for preparation of solutions for assay. Compounds were observed to be more potent when acoustic dispensing was used and the differences in IC50 values point to an aqueous solubility issue. The data in US7718653B2 formed the basis for the EOW2013 study.

So that brings us to the end of my review of the LR2024 study and I’ll be doing a follow up post later in the year.  One big difficulty in analysing differences between measured quantities is determining the extent to which measured values are directly comparable when IC50 can be influenced by factors such as the technology used to prepare assay solutions. Something that I think would have been worth investigating is the extent to which variability of measured values depends on potency (pIC50 measurements might be inherently more variable for less potent inhibitors than for highly potent inhibitors). The most serious criticism that I would make of LR2024 is it is not meaningful to combine data for different pairs of assays when calculating correlation-based measures of assay compatibility.