Molecular Design: A SMARTS way to do things?

Thursday, 15 September 2011

A SMARTS way to do things?

A couple of months ago I returned from a visit to OpenEye in Santa Fe, New Mexico. I’d been helping out with tautomers and ionisation and it really was great to be back in one of my favourite States of the Union catching up with some old friends while making some new ones. However, it’s neither tautomers nor ionisation that I’ll be discussing in this post because I really want to talk about SMARTS. This is a line notation for defining substructural queries and a SMARTS parser with full capability is one of the most powerful weapons in the molecular design arsenal. One of the things that I did in Santa Fe was to learn a bit about using the OpenEye SMARTS parser. I like to think of SMARTS as empowering in that a SMARTS parser allows me to impose my will on a database of chemical structures. This really brings out my latent megalomaniac and makes me want to gaze at large wall-mounted maps of the world…

SMARTS notation is actually very simple but at the same time is highly expressive. It’s best illustrated using some examples. Let’s start with a simple definition for a neutral carboxylic acid and I’ve kept things simple by not requiring a connection between the carbon and another carbon atom.

[OH]C=O

When dealing with commercially available collections of compounds, the carboxylic acids may be registered both in neutral and anionic (salt) forms. Although people in Pharma may whinge about, this one has to remember that a compound vendor needs to distinguish benzoic acid from sodium benzoate and I have no time for lily-livered whingers. As Marie Antoinette might have said, “Let them eat SMARTS”. Here are a couple of SMARTS queries that will match either neutral or anionic forms of carboxylic acids. [O;H,-] specifies an oxygen atom that either has a single hydrogen or a negative charge while [OD1] specifies an oxygen atom with a single non-hydrogen connection.

[O;H,-]C=O

[OD1]C=O

A SMARTS parser with full capability will not only match the substructural pattern but will also map individual atoms. This is really useful for atom typing and remember that you can get a lot of information (e.g. ionisation, interaction potential) about an atom from its connectivity. In a pharmacophore search I would want to treat both oxygen atoms of the carboxylic acid as anionic and might do this using recursive SMARTS as follows.

[$([OH]C=O),$(O=C[OH])

One of my favourite features of SMARTS is the vector binding which associates a SMARTS pattern with a label and allows you to create patterns that are much more human-readable. This is really important when creating a view of chemistry that is to be imposed on chemical databases. I’ll show how you can build a simple definition of aliphatic amines (remember that these usually protonate under normal physiological conditions) using vector bindings. First let’s define a carbon with four connections.

Csp3 [CX4]

Now we’ll use this to define primary, secondary and tertiary amines which we’ll then combine into a single all aliphatic amine definition. Notice how I ‘over-specify’ the nitrogen connectivity in order to prevent matching against amine oxides, protonated amines and quaternary ammonium.

PriAmin [N;H2;X3][$Csp3]

SecAmin [N;H;X3]([$Csp3])[$Csp3]

TerAmin [NX3]([$Csp3])([$Csp3])[$Csp3]

AllAmin [$PriAmin,$SecAmin,$TerAmin]

So that finishes our quick introduction to SMARTS notation. In my own work, I’ve used SMARTS not only to locate structural features in molecules but also to modify the molecules, for example to set ionisation states in a database of structures to be docked into the binding site of a protein. Being able to modify structures automatically and in a controlled manner also makes it possible to do cool stuff like identify matched molecular pairs ( mmp1 | mmp2 ). I should mention that there is a SMARTS-like notation called SMIRKS for modifying structures although I’m not going to say anything about it right now.

There’s plenty of information about SMARTS out there, including a Wikipedia page and the Daylight SMARTS Theory Manual, Tutorial and Examples. The Daylight and OpenEye SMARTS parsers are provided as tool kits (so you can build your own software) and both support recursive SMARTS and vector bindings (not all SMARTS parsers do this so check with your software vendor). I started with the Daylight product back in 1995 and taught myself some C in order to use it. However, the OpenEye SMARTS parser can also be used with 3D structures and I’m looking forward to doing lots more with it.

I’ll finish with some comments on terminology. A substructural definition written in SMARTS notation can be called a SMARTS pattern, a SMARTS string or even a SMARTS. Whatever you do, don’t call it a SMART (you wouldn’t talk about a specie in relation to living organisms) because that will make you look half-witted (and make me cringe). Also to talk about a SMILE or a SMIRK would be equally crass so don’t say I didn’t warn you.

Literature cited

Kenny & Sadowski Structure Modification in Chemical Databases, Methods and Principles of Medicinal Chemistry 2005, 23, 271-285 | DOI

Leach et al Matched Molecular Pairs as a Guide in the Optimization of Pharmaceutical Properties; a Study of Aqueous Solubility, Plasma Protein Binding and Oral Exposure J. Med. Chem. 2006, 49, 6672–6682 | DOI

Birch et al, Matched molecular pair analysis of activity and properties of glycogen phosphorylase inhibitors. Bioorg Med Chem Lett 2009, 19, 850-853 | DOI

1 comment:

Alexis said...: Just a quick thanks for your excellent article on SMARTS. It has helped me much.; 26 August 2020 at 12:10