Monday, 22 June 2026

The OpenBind initiative

I’ll open the post on the OpenBind initiative with photos from my visit last year to Korea which was timed to coincide with the cherry blossoms (this meant that the customary April Fools post was from Seoul). Things did not start well on the day that I took these photos (having lined up the first shot for the day it became abundantly clear that the camera’s battery was still being charged at the hotel) and I wondered whether Great Leader’s grandson might have labelled me as a dotard. Fortunately, Seoul’s Metro is excellent and I was still able to get some photos at Huiujeong-ro Cherry Blossom Road and Yangjaecheon Stream.









In this post I’ll be taking a look at the OpenBind initiative and here's a summary of the concept. I certainly see great value in having large quantities of this type of data (affinity measurements with X-ray crystal structures for the corresponding protein-ligand complexes) to the drug discovery and chemical biology communities. The grating-coupled interferometry (GCI) protocol used for affinity measurement enables association and dissociation to be observed in real time and presumably it  is also possible to characterize stoichiometry. Generation of data for training machine learning (ML) models, which are renowned for their voracious appetite for data, appears to be the principal aim of the initiative. However, the availability of large quantities of such data will also enable more extensive evaluation of physics-based methods for calculating binding affinity and can potentially inform hypothesis-driven design by identifying bioisosteric relationships between elements of substructure. One point worth making is that having affinity measurements linked to protein-ligand structures for structurally-related compounds of varying molecular complexity (see HLH2001) enables frustration of molecular interactions to be studied (this is particularly relevant to fragment-based design) and I discussed in HBD3 how frustration of hydration might be exploited in design). Given the importance of aqueous solvation in biomolecular recognition there might be benefits to being able to measure alkane/water partition coefficient and I'll point you to a post on this topic in case it's of interest. 

I've suggested that there are three objectives to drug design and the OpenBind initiative addresses the first of these which is to maximize on-target bioactivity. It's worth noting that proteins are not the only drug target class of interest (see CD2022) while bioactivity for ‘new modalities’ such as targeted protein degradation (see CC2026) and exploitation of irreversible covalent bond formation between targets and ligands cannot be quantified in terms of affinity alone. My view is that OpenBind would be more accurately described as an initiative for ligand discovery than for drug discovery given its focus on enabling methods for affinity prediction.  Modern ML models for affinity prediction are effectively quantitative structure-activity relationship (QSAR) models and I would question whether the use of the AI label is justified in either case. All that said, I would expect OpenBind to catalyse significant progress in the affinity prediction field which hopefully will translate to tangible benefits for drug discovery.

It’s perhaps appropriate to take a general look at QSAR approaches given that the main focus of OpenBind appears to be generation of data for training what could be referred to as 'QSAR-like' ML models. In my view, QSAR modelling never made much of a splash in real world drug discovery and claims that particular models have made significant impact on drug discovery projects are generally not verifiable. A difficulty faced by QSAR practitioners was that projects had delivered or been put out of their misery by the time there was sufficient data for building predictively useful models. Medicinal chemists typically perform their optimizations within specific structural series and this means that structure-activity relationships (SARs) tend to be local in nature (I’m not aware of any studies in which a QSAR model built using only data from one structural series was convincingly shown to be usefully predictive of bioactivity for compounds in a different structural series). For users of ML bioactivity models it is important to know whether chemical structures for which predictions are being made lie within the applicability domains of the models. The selection criteria for inclusion of targets and ligands by the OpenBind initiative are not currently clear.   

The availability of affinity measurements that are linked to X-ray crystal structures for the corresponding protein-ligand complexes enables affinity to be modelled in terms of the molecular interactions between proteins and their ligands. This is the approach used to create the scoring functions used in virtual screening and it provides a means to address the local nature of SARs. While this might seem to be an obvious way to model affinity data it's important to be aware that the contribution to affinity of an individual contact, such as a hydrogen bond, between the protein and ligand is not an experimental observable (see NoLE). Put another way, there is no unique way of decomposing a value of ΔG° (standard Gibbs free energy of binding) into a sum of terms based on individual noncovalent contacts between the protein and ligand. One reason reason for this is that association of proteins with their ligands occurs in aqueous media and this point has been clearly articulated in the S2012 study:

Molecular binding in an aqueous solvent can be usefully viewed not as an association reaction, in which only new intermolecular interactions are introduced between receptor and ligand, but rather as an exchange reaction in which some receptor–solvent and ligand–solvent interactions present in the unbound state are lost to accommodate the gain of receptor–ligand interactions in the bound complex.

However, there’s another reason why there’s no unique way to decompose binding free energy into a sum of terms based on individual noncovalent contacts and here’s a well-known equation written a bit differently to how you normally see it written:


This shows that the value of ΔG° varies with the concentration, C°, that defines the standard state. By convention C° is set to 1 M although this is arbitrary and has no physical basis (see G1997) and this means that the binding free energy values encountered by drug discovery scientists are always negative (consider the feasibility of measuring a Kd value of greater than 1 M). Writing ΔG° as a sum of terms based on individual non-covalent contacts is challenging because each term needs to depend on C° while the sum of terms needs to reproduce the dependence of ΔG° on C°. This is discussed in NoLE and the problems can be seen more easily if you think about how you might write Kd as a product of terms based on individual non-covalent contacts.

My understanding is that scoring functions (see GPD2018 | WBS2017 | A2015 | C2012 | S2012 | F2004SR2001 | GHK2000 | MM1999 | E1997 | MSK1992)  used in virtual screening are generally not predictive of affinity to the extent that they can be routinely used in lead optimization. Perhaps it will be different for Boltz-2 (described in the P2025 preprint) although questions have been raised in BSR2026 as to whether Boltz-2 "truly relies on the physics of intermolecular interactions" (the term “absolute FEP” rings some alarm bells for me). Various explanations have been offered for the typically underwhelming performance of scoring functions for affinity prediction including the usual suspects (protein flexibility, solvation and entropy). However, a much simpler explanation might be that scoring functions are trained to predict the difference in free energy between two states by only using the structure corresponding to one of the states.

I remain sceptical that it will prove feasible to build genuinely universal models for prediction of binding affinity from structures of protein-ligand complexes although I'll be very happy if my scepticism is shown to be unfounded. Describing energetics of target-ligand interactions in a general manner to enable ML modelling of affinity will be challenging because of the necessity to encode factors such as interaction potential, geometric dependence and solvent exposure (bear in mind that physics-based methods for prediction of affinity are already available and I'll direct readers to the Open Free Energy and open forcefield initiatives). While modelling affinity in terms of molecular interactions circumvents the need for training data to sample every conceivable combination of structural series with target the issue of defining applicability domains does not disappear.  My view is that when affinity datasets for different targets are combined for ML modelling, data should be split at the target level for cross-validation. This would entail splitting data so that each test set consists of only (and all) the data for a single target. 

Datasets generated by OpenBind are likely to also prove valuable for testing and development of physics-based approaches to affinity prediction such as use of simulation to calculate ‘absolute’ (ΔG°) and ‘relative’ (ΔΔG) free energy of binding.  Physics-based free energy calculations are typically more computationally demanding for ΔG° than for ΔΔG (a view expressed in B2009 is that is generally easier to predict differences in property values for pairs of structurally-related compounds than it is to predict property values from chemical structures of compounds). Methods for calculating ΔΔG (here’s a helpful review) are especially relevant to drug design because medicinal chemists typically work within structural series, defining SARs in terms of ratios of affinity (or potency) for pairs of structurally-related compounds. Put another way, ΔΔG calculations enable project team scientists to exploit existing project data to predict affinity for potential synthetic candidates and I would argue that ML modellers really do need to be thinking more about prediction of differences in affinity (and other pharmaceutically relevant properties) between structurally related compounds. As an aside, free energy perturbation (FEP) was a major source of inspiration when I started to use the Leatherface (don't ask  😁😁😁) chemical structure editing software to do matched molecular pair analysis (MMPA) in the late 1990s even though physics-based ΔΔG calculations were still largely seen as academic curiosities at that time.

While I’m certainly enthusiastic about physics-based methods such as FEP for calculating ΔΔG it’s not clear how generally these can handle significant modifications to the core of a structure (this is the scaffold-hopping scenario) and I would anticipate difficulties when the main effect of the structural perturbation is to alter conformational preference (as is the case for N-methylation of the secondary amide that is conserved in a number of SARS-CoV-2 main protease inhibitors). That said, the data generation capability of the OpenBind initiative should enable perceived weaknesses in FEP methodology to be addressed. I'll highlight a couple of general ways to use the data sets that OpenBind will generate might be used to validate methods for predicting relative affinity. First, you can use the relative affinity values that correspond to specific structural transformations such as chloro substitution (a good way to study activity cliffs and focusing on specific structural transformations counters criticism that predictive models are just capturing lipophilicity or molecular size), chloro to bromo (a good way to see if you're modelling halogen bonding effectively), and aromatic nitrogen to CH (in design it is useful to determine where polarity can be introduced with minimal loss of affinity). Second, you can use relative affinity measurements to assess how well models predict non-additivity in SARs (non-additivity can be also be considered in th activity cliff framework). I should point out that neither of these suggestions is novel (see L2012 and C2016) and activity to ML modellers are already looking at activity cliffs (see vT2022).     

This is a good point at which to wrap up and I'll be taking a look at the OpenADMET initiative in the next post.

Wednesday, 3 June 2026

Two new open data initiatives

I'll open the post with photos that I took in Seoul last year at the Dongdaemun Design Plaza (DDP) which was designed by Zaha Hadid (1950-2016) and I'm ashamed to admit to only having become aware of her ten years ago while wandering around the American University of Beirut (she studied mathematics at AUB and much later designed the building there that houses the Issam Fares Institute for Public Policy and International Affairs).






In the two posts that will follow the current post I’ll be taking a look at the OpenBind and OpenADMET initiatives. A key objective of each initiative is to generate large bodies of high-quality data that will be relevant to drug discovery and make these freely available in the public domain. I certainly see massive value in open data and consider it important that vital resources such as ChEMBL and BindingDB be funded generously. As we rightly celebrate the 2024 Nobel Chemistry Prize we also need to recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. All that said, achieving a coverage of chemical space that enables usefully predictive models for diverse pharmaceutically relevant phenomena is likely to prove challenging and those leading the OpenBind and OpenADMET initiatives will need to make it clear as to how compounds are to be selected for assaying and synthesis. 

Artificial Intelligence (AI) is currently touted as a panacea for the various difficulties faced by drug discovery scientists and sometimes it seems that drugs will condense out of the ether if only the experimentalists would generate enough data. It’s worth pointing out that many (most?) discovery projects that deliver candidates for clinical development do so without ever having sufficient data for building machine learning (ML) models for everything that needs to be measured. In my view this counters arguments that ML models are essential for drug discovery although I’m certainly not denying an important role for usefully predictive ML models.  Many (most?) of the ML models built for drug design are essentially what we used to call quantitative structure-activity/property (QSAR/QSPR) models and I would not label them as AI even when they are used to assess chemical structures generated by AI. One challenge for ML modellers is that they need to demonstrate that their models are usefully predictive outside the chemical spaces in which they have been trained. I argued in this blog post that ML models cannot be compared simply on the basis of how well they fit the data on which they have been trained and that it is necessary to account for model complexity (typically quantified by the number of adjustable parameters used to fit the data) when asserting that one ML model is superior to another.   

Recently I posted on the objectives of drug design in a way that I hoped would be useful for drug discovery scientists using AI and ML in design.  Drug action is driven by concentration (it’s more accurate to describe affinity as sensitivity to a driving force than as a driving force in its own right) and another way of stating this is that the effects of a drug on the human body are determined by the concentration of the drug that is ‘seen’ by its target(s) and anti-targets. A range of bioactivity assays are used by drug discovery scientists to assess the effects of compounds on targets and anti-targets (cell-based assays are also used to assess potential toxicity) and the QSAR field came into being to enable prediction of bioactivity from chemical structures.

It’s easy understand the pharmacological objectives of drug design which can be stated as “hit the targets” and “don’t hit the anti-targets”. Thinking in terms of concentration is a bit more difficult and it’s important to be aware that the concentration of a drug at its site(s) of action (usually referred to as exposure) is generally not something that you can measure unless the target(s) are in direct contact with plasma and I recommend that everybody working in drug discovery and chemical biology read the SR2019 study. Uncertainty in exposure for intracellular targets is also an issue in clinical development because failure to meet end points in a Phase 2 trial might simply be the result of inadequate exposure. I have argued in NoLE, HBD3 and this blog post that controllability of exposure should be seen as one of the objectives of drug design.

Not being able to measure the concentration of a drug at its site of action complicates drug discovery but is an issue that can be addressed. For example, we can invoke the free drug hypothesis (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’) by assuming that the concentration of the drug at its site(s) of action is equal to its unbound plasma concentration (which can be measured).  Some of this has been discussed in my post on the objectives of drug design and, in any case, I’ll be covering pharmacokinetic aspects of drug design in more detail when I post on the OpenADMET initiative.

This is a good point at which to conclude. I’ll examine OpenBind in the next post and OpenADMET in the post after that.