Christian Lemmen; BioSolveIT GmbH, 53757 Sankt Augustin, Germany
Similarity searching is a central concept for lead finding whenever the 3D structure of the target protein is unavailable. The search space can be any compound database, like an inventory, a vendor catalogue, or an enumerated combinatorial library. Most of the more rapid similarity searching methods available today, are based on descriptors like structural keys or fingerprints that represent a molecule as a linear string.
Feature Trees, as the name suggests, represent a molecule as an undirected labeled tree. Each tree node represents a small fragment or building block of the molecule. This can be hydrophobic fragments, ring systems, or functional groups of the molecule. Each node in the tree is labeled with a set of features representing the physico-chemical properties of the corresponding part of the molecule, such as information about the shape, H-bonding capabilities, and the like. A comparison of two molecules is performed by application of a tree matching algorithm and calculation of a similarity value based on this matching.
The tree matching algorithm is a recursive procedure based on matching subtrees of two feature trees onto each other. With this approach Feature Trees are quite fast. On average, a comparison of a molecule to a benchmark set (subset of the MDDR) with about 1000 molecules takes less than a minute. Feature Trees is a 2.5D descriptor and as such conformation independent and fragment based. Thus Feature Trees can handle local similarity. However, most importantly, it is not only a similarity measure, via the matching, a rough alignment of the molecules is obtained too.
On a dataset of 970 molecules assembled by Briem, the ability of the Feature Trees to identify molecules belonging to the same class of inhibitors has been demonstrated. For five different inhibitor classes mixed with randomly selected drug-like molecules, average enrichment factors compared to those obtained with Daylight fingerprints improved significantly in two of the classes. Two more performed similarly and only one slightly worse. On a second dataset of 58 molecules with known binding modes taken from the PDB, the relevance of the matchings produced by our algorithms could be demonstrated.
Using the feature tree descriptor, a method to search combinatorial libraries and large chemistry spaces, has been developed. This method is based on a dynamic programming algorithm that avoids enumerating or constructing any molecules but the actual hit molecules in the final stage. Under certain conditions, the search is optimal, i.e., it is guaranteed to find the molecules most similar to the query molecule. Besides highest similarity, the method can also be applied to search for molecules on any other similarity level with respect to a query molecule. Furthermore, it can be used to create a diverse subset of molecules similar to the query on a given similarity level.
The new search method has been applied to a chemistry space created by 'shredding' the WDI into about 17000 fragments that can be assembled via 12 different types of connections. The accessible space of reasonably sized molecules contains about 2.15e+18 molecules. A search in this space takes at most 20 minutes on a single CPU workstation. For testing, several drug-like molecules have been used as queries. Searching on similarity level 1.0, the query molecules and closely related analogs are consistently retrieved. Searching on a lower similarity level, structurally more diverse molecules still related to the query are found. In an example application, we demonstrate the ability to jump between known inhibitors of different structural classes on sets of about 50 dopamine d4 antagonists and 50 histamine H1 antagonists.
Matthias Rarey, Marc Zimmermann are the main developers of Feature Trees, however, numerous cooperation partners and users had and have great impact to the Feature Trees development: Scott, Jens Lösel, Martin Stahl and Thorsten Naumann. BioSolveIT GmbH obtained the rights to further develop and distribute Featue Trees commercially.
1.Rarey, M.; Dixon, J.S. Feature Trees: A new molecular similarity measure based on tree matching. J.Comput.-Aided Mol.Design, 1998, 12, 471—490
2.Schneider, G.; Lee, M.L.; Stahl, M.; Schneider, P. De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J.Comput.-Aided Mol.Design, 2000, 14, 487—494
3.www.biosolveit.de -> Software -> Ftrees
RECENT AND FORTHCOMING DEVELOPMENTS IN THE CAMBRIDGE STRUCTURAL DATABASE SYSTEM
I. Bruno, K.J. Lipscomb; CCDC, CB2 1EZ Cambridge, UK
The software accompanying the Cambridge Structural Database (CSD) has undergone considerable updating and change over the past three years – a process which is ongoing.
The older, QUEST3D interface to the CSD is now obsolete and has been replaced by ConQuest, a much more user-friendly system, supported on Unix, Linux and Windows operating systems. ConQuest functionality is ever-expanding, a new version of the software being distributed with each release of the CSD. Recent additions include peptide searching, the ability to cut and paste 2D structure diagrams in from ChemDraw and ISIS/Draw, and from April 2002, the ability to download regular updates for the Cambridge Structural Database itself, keeping the CSD more current than ever before.
Structure visualisation is now performed using the program Mercury. Distributed as part of the CSD System, and freely available for anyone to download from the CCDC website, Mercury is ideal for examining intermolecular interactions and the role they play in crystal packing.
The IsoStar knowledge-based library of intermolecular interactions, distilled from the CSD and the Protein Data Bank (PDB), remains available, and will shortly be joined by its intramolecular sister, Mogul. Mogul is a library of molecular geometry, incorporating knowledge about bond lengths, valence angles and acyclic torsions from the CSD, and is currently under development at CCDC with a view to eventual distribution as part of the CSD System.