Sunday, 22 May 2016

Sailor Malan's guide to fragment screening library design

Today I'll take a look at a JMC Perspective on design principles for fragment libraries that is intended to provide advice for academics. When selecting compounds to be assayed the general process typically consists of two steps. First, you identify regions of chemical space that you hope will be relevant and then you sample these regions. This applies whether you're designing a fragment library, performing a virtual screen or selecting analogs of active compounds with which to develop structure-activity relationships (SAR). Design of compound libraries for fragment screening has actually been discussed extensively in the literature and the following selection of articles, some of which are devoted to the topic, may be useful: Fejzo (1999), Baurin (2004), Mercier (2005), Schuffenhauer (2005), Albert (2007) Blomberg (2009), Chen (2009), Law (2009), Lau (2011), Schulz (2011); Morley (2013). This series of blog posts ( 1 | 2 | 3 | 4) on fragment screening library design that may also be helpful.

The Perspective opens with the following quote:

"Rules are for the obedience of fools and the guidance of wise men"

Harry Day, Royal Air Force (1898-1977)

It wasn't exactly clear what the authors are getting at here since there appears to be no provision for wise women. Also it is not clear how the authors would view rules that required darker complexioned individuals to sit at the backs of buses (or that swarthy economists should not solve differential equations on planes). That said, the quote hands me a legitimate excuse to link Malan's Ten Rules for Air Fighting and I will demonstrate that the authors of this Perspective can learn much from the wise teachings of 'Sailor' Malan.

My first criticism of this Perspective is that the authors devote an inordinate amount of space to topics that are irrelevant from the viewpoint of selecting compounds for fragment screening. Whatever your views on the value of ligand efficiency metrics and thermodynamic signatures, these are things that you think about once you've got the screening results. The authors assert, "As a result, fragment hits form high-quality interactions with the target, usually a protein, despite being weak in potency" and some readers might consider the 'concept' of high-quality interactions to be pseudoscientific psychobabble on par with homeopathy, chemical-free food and the wrong type of snow. That said, discussion of some of these peripheral topics would have been more acceptable if the authors had articulated the library design problem clearly and discussed the most relevant literature early on. By straying from their stated objective, the authors have broken the second of Malan's rules ("Whilst shooting think of nothing else, brace the whole of your body: have both hands on the stick: concentrate on your ring sight").

The section on design principles for fragment libraries opens with a slightly gushing account of the Rule of 3 (Ro3). This is unfortunate because this would have been the best place for the authors to define the fragment library design problem and review the extensive literature on the subject. Ro3 was originally stated in a short communication and the analysis that forms its basis is not shared. As an aside, you need to be wary of rules like these because the cutoffs and thresholds may have been imposed arbitrarily by those analyzing the data. For example, the GSK 4/400 rule actually reflects the scheme used to categorize continuous data and it could just have easily been the GSK 3.75/412 rule if the data had been pre-processed differently. I have written a couple ( 1 | 2 ) of blog posts on Ro3 but I'll comment here so as to keep this post as self-contained as possible. In my view, Ro3 is a crude attempt to appeal to the herding instinct of drug discovery scientists by milking a sacred cow (Ro5). The uncertainties in hydrogen bond acceptor definitions and logP prediction algorithms mean that nobody knows exactly how others have applied Ro3. It also is somewhat ironic that the first article referenced by this Perspective actually states Ro3 incorrectly. If we assume that Ro5 hydrogen bond acceptor definitions are being used then Ro3 would appear to be an excellent way to ensure that potentially interesting acidic species such as tetrazoles and acylsulfonamides are excluded from fragment screening libraries. While this might not be too much of an issue if identification of adenine mimics is your principal raison d'etre, some researchers may wish to take a broader view of the scope of FBDD. It is even possible that rigid adherence to Ro3 may have led to the fragment starting points for this project being discovered in Gothenburg rather than Cambridge. Although it is difficult to make an objective assessment of the impact of Ro3 on industrial FBDD, its publication did prove to be manna from heaven for vendors of compounds who could now flog milligram quantities of samples that had previously been gathering dust in stock rooms.

This is a good point to see what 'Sailor' Malan might have made of this article. While dropping Ro3 propaganda leaflets, you broke rule 7 (Never fly straight and level for more than 30 seconds in the combat area) and provided an easy opportunity for an opponent to validate rule 10 (Go in quickly - Punch hard - Get out). Faster than you can say "thought leader" you've been bounced by an Me 109 flying out of the sun. A short, accurate (and ligand-efficient) burst leaves you pondering the lipophilicity of the mixture of glycol and oil that now obscures your windscreen. The good news is that you have been bettered by a top ace whose h index is quite a bit higher than yours. The bad news is that your cockpit canopy is stuck. "Spring chicken to shitehawk in one easy lesson."

Of course, there's a lot more to fragment screening library design than counting hydrogen bonding groups and setting cutoffs for molecular weight and predicted logP. Molecular complexity is one of the most important considerations when selecting compounds (fragments or otherwise) and anybody even contemplating compound library design needs to understand the model introduced by Hann and colleagues. This molecular complexity model is conceptually very important but it is not really a practical tool for selecting compounds. However, there are other ways to define molecular complexity in ways that allow the general concept to be distilled into usable compound selection criteria. For example, I've used restriction of extent of substitution (as detailed in this article) to control complexity and this can be achieved using SMARTS notation to impose substructural requirements. The thinking here is actually very close to the philosophy behind 'needle screening' which was first described in 2000 by researchers at Roche although they didn't actually use the term 'molecular complexity'.

As one would expect, the purging of unwholesome compounds such as PAINS is discussed. The PAINS field suffers from ambiguity, extrapolation and convolution of fact with opinion. This series ( 1 | 2 | 3 | 4) of blog posts will give you a better idea of my concerns. I say "ambiguity" because it's really difficult to know whether the basis for labeling a compound as a PAIN (or should that be a PAINS) is experimental observation, model-based prediction or opinion. I say "extrapolation" because the original PAINS study equates PAIN with frequent-hitter behavior in a panel of six AlphaScreen assays and this is extrapolated to pan-assay (which many would take to mean different types of assays) interference. There also seems to be a tendency to extrapolate the frequent-hitter behavior in the AlphaScreen panel to reactivity with protein although I am not aware that any of the compounds identified as PAINS in the original study were shown to react with any of the proteins in the AlphaScreen panel used in that study. This is a good point to include a graphic to break the text up a bit and, given an underlying theme of this post, I'll use this picture of a diving Stuka.

One view of the fragment screening mission is that we are trying to present diverse molecular recognition elements to targets of interest. In the context of screening library design, we tend to think of molecular recognition in terms of pharmacophores, shapes and scaffolds. Although you do need to keep lipophilicity and molecular size under tight control, the case can be made for including compounds that would usually be considered to be beyond norms of molecular good taste. In a fragment screening situation I would typically want to be in a position to present molecular recognition elements like naphthalene, biphenyl, adamantane and (especially after my time at CSIRO) cubane to target proteins. Keeping an eye on both molecular complexity and aqueous solubility, I'd select compounds with a single (probably cationic) substituent and I'd not let rules get in the way of molecular recognition criteria. In some ways compound selections like those above can be seen as compliance with Rule 8 (When diving to attack always leave a proportion of your formation above to act as top guard). However, I need to say something about sampling chemical space in order to make that connection a bit clearer.

This is a good point for another graphic and it's fair to say that the Stuka and the B-52 differed somewhat in their approaches to target engagement. The B-52 below is not in the best state of repair and, given that I took the photo in Hanoi, this is perhaps not totally surprising. The key to library design is coverage and former bombardier Joseph Heller makes an insightful comment on this topic. One wonders what First Lieutenant Minderbinder would have made of the licensing deals and mergers that make the pharma/biotech industry such an exciting place to work.  

The following graphic, pulled from an old post, illustrates coverage (and diversity) from the perspective of somebody designing a screening library.  Although I've shown the compounds in a 2 dimensional space, sampling is often done using molecular similarity which we can think of inversely related to distance. A high degree of molecular similarity between two compounds indicates that their molecular structures are nearby in chemical space.  This is a distance-geometric view of chemical space in which we know the relative positions of molecular structures but not where they are.  When we describe a selection of molecular structures as diverse, we're saying that the two most similar ones are relatively distant from each other. The primary objective of screening library design is to cover relevant chemical space as effectively as possible and devil is in the details like 'relevant' and 'effectively'. The stars in the graphic below show molecular structures that have been selected to cover the chemical space shown. When representing a number of molecular structures by a single molecular structure it is important, as it is in politics, that what is representative not be too distant from what is being represented. You might ask, "how far is acceptable?" and my response would be, as it often is in Brazil, "boa pergunta". One problem is scaffolds differ in their 'contributions' to molecular similarity and activity cliffs usually provide a welcome antidote to the hubris of the library designer.         

I would argue that property distributions are more important than cutoff values for properties and it is during the sampling phase of library design that these distributions are shaped. One way of controlling distributions is to first define regions of chemical space using progressively less restrictive selection criteria and then sample these in order, starting with the most restrictively defined region. However, this is not the only way to sample and might also try to weight fragment selection using desirability functions. Obviously, I'm not going to provide a comprehensive review of chemical space sampling in a couple of paragraphs of a blog post but I hope to have shown that the sampling of chemical space is an important aspect of fragment screening library design. I also hope to have shown that failing to address the issue of sampling relevant chemical space represents a serious deficiency of the featured Perspective

The Perspective concludes with a number of recommendations and I'll conclude the post with comments on some of these. I wouldn't have too much of a problem with the proposed 9 - 16 heavy atom range as a guideline although I would consider a requirement that predicted octanol/water logP be in the range 0.0 - 2.0 to be overly restrictive. It would have been useful for the authors to say how they arrived at these figures and I invite all of them to think very carefully about exactly what they mean by "cLogP" and "freely rotatable bonds" so we don't have a repeat of the Ro3 farce. There are many devils in the details of the statement:"avoid compounds/functional groups known to be associated with high reactivity, aggregation in solution, or false positives".  My response to "known" is that it is not always easy to distinguish knowledge from opinion and "associated" (like correlated) is not a simple yes/no thing. It is not clear how "synthetically accessible vectors for fragment growth" should be defined and there is also a conformational stability issue if bonds to hydrogen are regarded as growth vectors.   

This is a good point at which to wrap things up and I'd like to share some more of Sailor Malan's wisdom before I go. The first rule (Wait until you see the whites of his eyes. Fire short bursts of 1 to 2 seconds and only when your sights are definitely 'ON') is my personal favorite and it provides excellent, practical advice for anybody reviewing the scientific literature. I'll leave you with a short video in which a pre-jackal Edward Fox displays marksmanship and escaping skills that would have served him well in the later film. At the start of the video, the chemists and biologists have been bickering (of course, this never really happens in real life) and the VP for biophysics is trying to get them to toe the line. Then one of the biologists asks the VP for biophysics if they can do some phenotypic screening and you'll need to watch the video (or this version) to see what happens next...

Sunday, 8 May 2016

A real world perspective on molecular design

I'll be taking a look at a Real-World Perspective on Molecular Design which has already been reviewed by Ash. I don't agree that this study can accurately be described as 'prospective' although, in fairness, it is actually very difficult to publish molecular design work in a genuinely prospective manner. Another point to keep in mind is that molecular modelers (like everybody else in drug discovery) are under pressure to demonstrate that they are making vital contributions. Let's take a look at what the authors have to say:

"The term “molecular design” is intimately linked to the widely accepted concept of the design cycle, which implies that drug discovery is a process of directed evolution (Figure 1). The cycle may be subdivided into the two experimental sections of synthesis and testing, and one conceptual phase. This conceptual phase begins with data analysis and ends with decisions on the next round of compounds to be synthesized. What happens between analysis and decision making is rather ill-defined. We will call this the design phase. In any actual project, the design phase is a multifaceted process, combining information on status and goals of the project, prior knowledge, personal experience, elements of creativity and critical filtering, and practical planning. The task of molecular design, as we understand it, is to turn this complex process into an explicit, rational and traceable one, to the extent possible. The two key criteria of utility for any molecular design approach are that they should lead to experimentally testable predictions and that whether or not these predictions turn out to be correct in the end, the experimental result adds to the understanding of the optimization space available, thus improving chances of correct prediction in an iterative manner. The primary deliverable of molecular design is an idea [4] and success is a meaningful contribution to improved compounds that interrogate a biological system."

This is a certainly a useful study although I will make some criticisms in the hope that doing so stimulates discussion. I found the quoted section to lack coherence and would argue that  the design cycle is actually more of a logistic construct than a conceptual one. That said, I have to admit that it's not easy to clearly articulate what is meant by the term 'molecular design'. One definition of molecular design is control of behavior of compounds and materials by manipulation of molecular properties. Using the term 'behavior' captures the idea that we design compounds to 'do' rather than merely to 'be'. I also find it useful to draw a distinction between hypothesis-driven molecular design (ask good questions) and prediction-driven molecular design (synthesize what the models, metrics or tea leaves tell you to). Asking good questions is not as easy as it sounds because it it is not generally possibly to perform controlled experiments in the context of molecular design as discussed in another post from Ash. Hypothesis-driven molecular design can also be thought of as a framework in which to efficiently obtain the information required to make decisions and, in this sense, there are analogies with statistical molecular designI believe that the molecular design that the authors describe in the quoted section is of the hypothesis-driven variety but hand-wringing about how "ill-defined" it is doesn't really help move things forward. The principal challenges for hypothesis-driven molecular design are to make it more objective, systematic and efficient. I'll refer you to a trio of blog posts ( 1 | 2 | 3) in which some of this is discussed in more detail.

I'll not say anything specific about the case studies presented in this study except to note that sharing specific examples of application of  molecular design as case studies does help to move the field forward even when the studies are incomplete. The examples do illustrate how the computational tools and structural databases can be used to provide a richer understanding of molecular properties such as conformational preferences and interaction potential. The CSD (Cambridge Structural Database) is a particularly powerful tool and, even in my Zeneca days, I used to push hard to get medicinal chemists using it. Something that we in the medicinal chemistry community might think about is how incomplete studies can be published so that specific learning points can be shared widely in a timely manner.  

But now I'd like to move on to the conclusions, starting with 1 (value of quantitative statements), The authors note:

"Frequently, a single new idea or a pointer in a new direction is sufficient guidance for a project team. Most project impact comes from qualitative work, from sharing an insight or a hypothesis rather than a calculated number or a priority order. The importance of this observation cannot be overrated in a field that has invested enormously in quantitative prediction methods. We believe that quantitative prediction alone is a misleading mission statement for molecular design. Computational tools, by their very nature, do of course produce numerical results, but these should never be used as such. Instead, any ranked list should be seen as raw input for further assessment within the context of the project. This principle can be applied very broadly and beyond the question of binding affinity prediction, for example, when choosing classification rather than regression models in property prediction."
This may be uncomfortable reading for QSAR advocates, metric touts and those who would have you believe that they are going to disrupt drug discovery by putting cheminformatics apps on your phone. It also is close to my view of the role of computational chemistry in molecular design (the observant reader will have noticed that I didn't equate the two activities) although, in the interests of balance, I'll refer you to a review article on predictive modelling. We also need to acknowledge that predictive capability will continue to improve (although pure prediction-driven pharmaceutical design is likely to be at least a couple of decades away) and readers might find this blog post to be relevant. 

Let's take a look at conclusion 5 (Staying close to experiment) and the authors note:

"One way of keeping things as simple as possible is to preferentially utilize experimental data that may support a project, wherever this is meaningful. This may be done in many different ways: by referring to measured parameters instead of calculated ones or by utilizing existing chemical building blocks instead of designing new ones or by making full use of known ligands and SAR or related protein structures. Rational drug design has a lot to do with clever recycling."

This makes a lot of sense although I don't recommend use of the tautological term 'rational drug design' (has anybody ever done irrational drug design?). What they're effectively saying here is that it is easier to predict the effect of structural changes on properties of compounds than it is to predict those properties directly from molecular structure. The implications of this for cheminformaticians (and others seeking to predict behaviour of compounds) is that they need to look at activity and chemical properties in terms of relationships between the molecular structures of compounds. I've explored this theme, both in an article and a blog post, although I should point out that there is a very long history of associating changes in the values of properties of compounds with modifications to molecular structures.

However, there is another side to "staying close to experiment" and that is recognizing what is and what isn't an experimental observable. The authors are clearly aware of this point when they state: 

"MD trajectories cannot be validated experimentally, so extra effort is required to link such simulation results back to truly testable hypotheses, for example, in the qualitative prediction of mechanisms or protein movements that may be exploited for the design of binders."

When interpreting structures of protein-ligand complexes, it is important to remember that the contribution of an intermolecular contact to affinity is not, in general, an experimental observable. As such, it would have been helpful if the authors had been a bit more explicit about exactly which experimental observable(s) form the basis of the "Scorpion network analysis of favorable interactions". The authors make a couple of references to ligand efficiency and I do need to point out that scaling free energy of binding has no thermodynamic basis because, in general, our perception of efficiency changes with the concentration used to define the standard state. On a lighter note there is a connection between ligand efficiency and homeopathy that anybody writing about molecular design might care to ponder and that's where I'll leave things.