Molecular Design: Two new open data initiatives

Wednesday, 3 June 2026

Two new open data initiatives

I'll open the post with photos that I took in Seoul last year at the Dongdaemun Design Plaza (DDP) which was designed by Zaha Hadid (1950-2016) and I'm ashamed to admit to only having become aware of her ten years ago while wandering around the American University of Beirut (she studied mathematics at AUB and much later designed the building there that houses the Issam Fares Institute for Public Policy and International Affairs).

In the two posts that will follow the current post I’ll be taking a look at the OpenBind and OpenADMET initiatives. A key objective of each initiative is to generate large bodies of high-quality data that will be relevant to drug discovery and make these freely available in the public domain. I certainly see massive value in open data and consider it important that vital resources such as ChEMBL and BindingDB be funded generously. As we rightly celebrate the 2024 Nobel Chemistry Prize we also need to recognize the remarkable foresight of those who launched the Protein Data Bank in 1971 with just seven X-ray crystal structures. All that said, achieving a coverage of chemical space that enables usefully predictive models for diverse pharmaceutically relevant phenomena is likely to prove challenging and those leading the OpenBind and OpenADMET initiatives will need to make it clear as to how compounds are to be selected for assaying and synthesis.

Artificial Intelligence (AI) is currently touted as a panacea for the various difficulties faced by drug discovery scientists and sometimes it seems that drugs will condense out of the ether if only the experimentalists would generate enough data. It’s worth pointing out that many (most?) discovery projects that deliver candidates for clinical development do so without ever having sufficient data for building machine learning (ML) models for everything that needs to be measured. In my view this counters arguments that ML models are essential for drug discovery although I’m certainly not denying an important role for usefully predictive ML models. Many (most?) of the ML models built for drug design are essentially what we used to call quantitative structure-activity/property (QSAR/QSPR) models and I would not label them as AI even when they are used to assess chemical structures generated by AI. One challenge for ML modellers is that they need to demonstrate that their models are usefully predictive outside the chemical spaces in which they have been trained. I argued in this blog post that ML models cannot be compared simply on the basis of how well they fit the data on which they have been trained and that it is necessary to account for model complexity (typically quantified by the number of adjustable parameters used to fit the data) when asserting that one ML model is superior to another.

Recently I posted on the objectives of drug design in a way that I hoped would be useful for drug discovery scientists using AI and ML in design. Drug action is driven by concentration (it’s more accurate to describe affinity as sensitivity to a driving force than as a driving force in its own right) and another way of stating this is that the effects of a drug on the human body are determined by the concentration of the drug that is ‘seen’ by its target(s) and anti-targets. A range of bioactivity assays are used by drug discovery scientists to assess the effects of compounds on targets and anti-targets (cell-based assays are also used to assess potential toxicity) and the QSAR field came into being to enable prediction of bioactivity from chemical structures.

It’s easy understand the pharmacological objectives of drug design which can be stated as “hit the targets” and “don’t hit the anti-targets”. Thinking in terms of concentration is a bit more difficult and it’s important to be aware that the concentration of a drug at its site(s) of action (usually referred to as exposure) is generally not something that you can measure unless the target(s) are in direct contact with plasma and I recommend that everybody working in drug discovery and chemical biology read the SR2019 study. Uncertainty in exposure for intracellular targets is also an issue in clinical development because failure to meet end points in a Phase 2 trial might simply be the result of inadequate exposure. I have argued in NoLE, HBD3 and this blog post that controllability of exposure should be seen as one of the objectives of drug design.

Not being able to measure the concentration of a drug at its site of action complicates drug discovery but is an issue that can be addressed. For example, we can invoke the free drug hypothesis (‘principle’ and ‘theory’ are also used in this context although I personally prefer ‘hypothesis’) by assuming that the concentration of the drug at its site(s) of action is equal to its unbound plasma concentration (which can be measured). Some of this has been discussed in my post on the objectives of drug design and, in any case, I’ll be covering pharmacokinetic aspects of drug design in more detail when I post on the OpenADMET initiative.

This is a good point at which to conclude. I’ll examine OpenBind in the next post and OpenADMET in the post after that.

Wednesday, 3 June 2026

Two new open data initiatives

No comments: