NOVEL POPULATION-BASED OPTIMIZATION TECHNIQUES FOR FEATURE SELECTION: ARTIFICIAL ANTS AND PARTICLE SWARMS
Sergei Izrailev, Walter Cedeno, Dimitris Agrafiotis; 3-Dimensional Pharmaceuticals, Inc., Exton, PA 19341,USA
We compare the performance of novel feature selection algorithms for structure-activity and structure-property correlation based on artificial ant colony systems and particle swarms with techniques based on simulated annealing. These algorithms are inspired by the cooperative behavior of animal populations. The artificial ant colony approach capitalizes on the mechanism that allows real ants to find the shortest path between a food source and their nest using deposits of pheromone as a communication agent. The particle swarms technique takes advantage of the observation that social interaction, which is believed toplay a crucial role in human cognition, can serve as a valuable heuristic in identifying optimal solutions to difficult optimization problems. The
underlying basic self-organizing principles are exploited for the construction of parsimonious QSAR models based on neural networks. The advantages of these novel optimization paradigms are demonstrated using several classical QSAR data sets.
ACCURATE BATCH PREDICTION OF HAMMETT SIGMA PARAMETERS FOR SUBSTITUENTS AND PKA'S FOR IONIZABLE GROUPS
Guy Desmarquets, R.S. DeWitte, D. Jouravleva, E. Kolovanov; Advanced Chemistry Development, Obernai,67210 France
Designing experiments with congeners and applying results via QSAR relies on effective selection of members of the congeneric series. In particular, one is interested in spanning the space of relevant properties evenly so that interpretation is not confounded by bias. The classical Hammett-type QSAR parameters provide insight into the electronic impact that a substituent has on the whole, and in particular on the dissociation strength of a nearby ionizable group. Therefore, it is possible that substitutions made in one part of a molecule for steric reasons can unwittingly modulate the hydrogen bonding strength at another part.
The study of biological activity through Quantitative Structure Activity Relationships has evolved a great deal since the seminal work of Hammett, Taft and their contemporaries. It is now more common for molecular descriptors (often topological, or bit-mapped) to be used as the key representation of chemical structure. Since the provocative studies published by Lipinski in the mid-nineties, whole molecule properties have come to the fore in experimental design, specifically as exclusion rules governing acceptable ranges of Solubility, molecular weight and LogD, among others. Applying the resulting correlations can be a significant challenge because the discrete representations may not point directly to vectors of chemical improvement.
Reverting to the original concepts, however, and deriving QSAR correlations on the basis of the slowly varying electronic constants among a family of substituents within a congeneric series, provides a conceptual framework of understanding that should reveal a physical direction for molecular improvement. In particular, one may see that the first principal component that explains the variance in activity is loaded most heavily on s * at site 1, and negatively on s Para- at site 2; the second component is loaded on the square of molecular weight and LogP; and the third component tracks the pKa of a heterocyclic nitrogen. Presuming that these results stem from a well-designed experiment in the space of these substituent constants, one could hypothesize that increased activity would be achieved by substitutions at site 1 that increases s *, and at site 2, that decreases s Para-, while leaving molecular weight and LogP unchanged.
A software system and computation procedure is presented which enables computational chemists to automatically compute the substituent constants for all substituents among related compounds, as well as the pKa of specific ionization centers present throughout such a series. This tool should enable QSAR specialists to design and interpret series of compounds with explicit reference to these phenomena. Further a tool is presented that allows drawing and estimation of substituent constants on an ad hoc basis so that new compounds for synthesis can be designed whose substituents conform to the hypothesis generated from the QSAR.
If, instead, one is analyzing the results from an initial series of compounds in which there has been little consideration for the experimental design, it is possible to apply these computations, and traditional design of experiments statistics to arrive at a subset of compounds to use for analysis, or, still more effective, to select appropriate compounds that will augment the existing data to provide a well designed basis for analysis by Hammett-sigma constants and whole molecule properties.
QUANTITATIVE STRUCTURE-ENANTIOSELECTIVITY RELATIONSHIPS (QSER) BASED ON CHIRALITY CODES. APPLICATION TO A COMBINATORIAL LIBRARY OF CATALYTIC ENANTIOSELECTIVE REACTIONS
João Aires-de-Sousaa, Johann Gasteigerb; a Secção de Química Orgânica Aplicada, Departamento de Química, CQFB, campus Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Portugal, b Computer-Chemie-Centrum, Institute for Organic Chemistry, University of Erlangen-Nürnberg, D-91052 Erlangen, Germany
The biological and chemical properties exhibited by opposite enantiomers of chiral compounds are frequently different. This subtle geometrical fact has profound practical consequences in biology, environmental sciences and pharmacology. Particularly, the fact that two enantiomers often have different biological activity makes chirality one of the most important factors in drug safety evaluation today. It is estimated that 40% of all dosage-form drug sales in 2000 were of single enantiomers. In 1999, the share was one-third . Methods for enantioselective synthesis are therefore highly desirable, and high throughput methodologies are being used for their development.
The extraction of useful knowledge from the huge amounts of results produced by current technology can only be done by appropriate and efficient computational techniques. QSAR (Quantitative Structure-Activity Relationships) and QSPR (Quantitative Structure-Property Relationships) are well-established fields and play important roles in this endeavor. However, the development of relationships between chiral structures and chiral properties has been extremely rare in the past. One of the main reasons for this fact is certainly the lack of appropriate descriptors of chirality that can be used for the representation of chiral molecular structures, and that can distinguish between enantiomers.
We made recently contributions to that field, developing new chirality descriptors (chirality codes) and successfully applying them to the prediction of enantioselectivity in chemical reactions and chromatography [2,3]. These representations were shown to be particularly adequate as input to neural networks opening the way to many possible applications of neural networks in the field of stereochemistry.
In an application to enantioselective reactions , neural networks established relationships between the chirality codes of catalysts, and the major enantiomer produced by the reaction. In a second type of applications, the elution order of enantiomers in chromatographic resolutions could be predicted by neural networks on the basis of the chirality codes of the analytes .
Now we describe quantitative relationships between the chirality codes of catalysts (and activators), and the enantiomeric excess of the product in a catalytic enantioselective reaction. This is the enantioselective addition of diethyl zinc to benzaldehyde in the presence of a racemic catalyst DB and an enantiopure chiral additive AA .
All the combinations of five racemic catalysts DB and 13 enantiopure additives AA were tested (combinatorial library of 65 reactions) and the results were available in the literature . Each of the 65 entries of the library was represented by molecular descriptors of the corresponding racemic catalyst (DB) and chiral additive (AA). Chiral additives AA were represented by conformation independent chirality codes. Racemic catalysts DB were used as racemic mixtures, and were therefore represented by absolute values of conformation independent chirality codes.
After selection of variables (by exclusion of correlated variables and removal of negligible components of the chirality codes) backpropagation NNs were trained using the molecular representations as the input and the enantiomeric excess (e.e.) as the output.
The 65 objects were split into three data sets (A, B, and C), each data set covering the whole range of observed e.e. (-71% – +16%).
The network was trained with data sets A and B, and tested with data set C. Several experiments were performed using different parameters in the chirality codes, different criteria for variable selection, different network architectures and different training parameters. Most experiments yielded accurate predictions of the e.e.’s for the test set (RMS error 5-6% e.e.). The method not only estimates the e.e. of the product but also predicts the preferred enantiomer of the reaction.
In this communication, the chirality codes will be briefly described, and several QSER experiments will be presented and discussed.
 S. C. Stinson, Chemical & Engineering News 2001, 79(40), 79-97.
 J. Aires-de-Sousa, J. Gasteiger, J. Chem. Inf. Comp. Sci. 2001, 41(2), 369-375.
 J. Aires-de-Sousa, J. Gasteiger, J. Molec. Graph. Model. 2002, 20(5), 373-388.
 J. Long, K. Ding, Angew. Chem. 2001, 113(3), 562-565; Angew. Chem. Int. Ed. 2001, 40(3), 544-547.
PHARMAEXPERT: CLUSTERING OF CHEMICAL COMPOUNDS WITH REQUIRED BIOLOGICAL ACTIVITY SPECTRA IN LARGE DATABASES
A. A. Lagunin, D. A. Filimonov, V. V. Poroikov; Institute of Biomedical Chemistry RAMS, 119992 Moscow, Russia
Each biologically active compound reveals wide spectra of biological actions in biological systems (human organism, experimental animals, in vivo and in vitro assays). It is practically impossible to study each compound in all tests currently available. Therefore, the ability to select compounds with required types of biological activity and without unwanted adverse effects and toxicity is very important in retrieval of with large structural databases. The software PASS (Prediction of Activity Spectra for Substances) and PharmaExpert were developed toward this purpose. PASS predicts biological activity spectra on the basis of structural formulae of chemical compounds. Biological activity of compounds is predicted on the basis of structure-activity relationships of known biological active substances presented in the training set (PASS 1.511 training set includes about 43000 substances). The current version of PASS can predict 783 different types of biological activity including pharmacological effects, biochemical mechanisms, carcenogenicity, mutagenicity and teratogenicity. The mean prediction accuracy in leave one out cross-validations of PASS is about 85%. The PharmaExpert software was developed to cluster chemical compounds with required biological activities and without unwanted effects in large databases based on the predictions of PASS and a knowledgebase of mechanism-effect relationships. PharmaExpert improves the prediction results of PASS and determine the existing relationships between pharmacological effects and biochemical mechanisms in predicted activity spectra. The current version of PharmaExpert covers 1587 mechanisms of action, 418 pharmacotherapeutical effects and 2664 types relationships between them. PASS and PharmaExpert retrieve the with large databases of structures very quickly (see below). Therefore, one may estimate the total chemistry-pharmacology space for new substances from large database. Estimated probabilities for different types of biological activity can be used as the parameters for clustering of compounds in large databases and provide the criteria for assessment of biological diversity in addition to the widely used chemical descriptors. Moreover, on the basis of predicted biological spectra we can arrange the sub-sets of substances by selection of required samples of activities and then cluster the substances based on their structural similarity. It allows us to estimate the structural diversity of sub-sets and observe the changing of the probability of biological activity in a cluster of similar substances. The data on 'mechanism-effect' relationships allows us to select substances in more intelligent mode. Therefore we can select compounds, which may reveal both a particular pharmacological effect and its molecular mechanisms of action. For example, we try to find the anxiolytics in the NCI (National Cancer Institute of US) database, which contains structures of 250000 substances. The prediction for compounds of NCI database takes 1.5 hour in a personal computer with processor Athlon 1400 MHz and 512 Mb RAM. Our study shows the selection of substances, which may reveal both anxiolytic effect and antagonism of 5 HT 1A receptors using PharmaExpert, taken several minutes. It was found 2867 substances with probability of required types of biological activity more then 30%, 630 substances with probability more then 50% and 8 substances with probability more then 90%. Selecting the sub-set included the substances with estimated probability values closest to 50% increases the chance to find some New Chemical Entities. Thus, combination of PASS and PharmaExpert is an effective and flexible tool for exhaustive evaluation and clustering the compounds from large databases of substances in general chemistry-pharmacology space.
This work was executed with support of INTAS (grant INTAS 00-0711).
NEW METHODS FOR COMPUTER-AIDED ANALYSIS OF HTS DATA
Marc Zimmermann1, Matthias Rarey1,Thorsten Naumann2, Joannis Apostolakis1, Thomas Lengauer3; 1Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), 53754 Sankt Augustin, Germany, 2Molecular Modeling, Aventis Pharma Deutschland GmbH, 65926 Frankfurt am Main, 3Max-Planck-Institut für Informatik, 66123 Saarbrücken, Germany
We introduce a novel concept for the interpretation of high throughput screening (HTS) data and the visualization of structural classes of active compounds, called a biophore. Biophores are based on the feature tree descriptors  and correspond to a fragment-based description of molecules with similar activity and structure. Each fragment in a biophore is further qualified by topology (connectivity to the other fragments of the molecule) and a weight. Weights are derived based on the available experimental activities. For the derivation of biophores we developed a novel software tool, HTSview. HTSview supports not only the automatic generation of biophores. It also includes a number of visualization and data analysis capabilities that support the chemist through the process of selecting suitable hits from experimental data to find new lead structures, based on the results from HTS experiments (i.e. coarse binding information and the corresponding molecular structures).
For the generation of biophores we developed new algorithms for the multiple alignment of feature trees. First the molecules are clustered to define activity regions. All molecules within an activity region are combined into a multiple topological template containing the matched fragments. A two dimensional mapping of the matches describes the resulting model and indirectly the topology of the molecules in the activity region. The fragments in the different parts of the model are subsequently weighted by learning methods that correlate occurrence of the fragments with biological activity.
data flow: HTS data -> activity region(s) -> biophore model(s)
The development of a new drug in pharmaceutical research is extremely time-consuming and expensive. Through the rapid progress in the fields of combinatorial chemistry and high-throughput screening (HTS) there is an increasing need for interpreting huge amounts of binding data. As an additional complication the data is in general of poor accuracy containing significant systematic and statistical errors.
We developed a new software tool called HTSview for HTS data analysis combining cheminformatics methods with data mining techniques.
For the basic concepts of molecular similarity, we used our previously developed Feature Trees descriptor  for the basic concepts of molecular similarity. Feature Trees are fragment-based, non-linear, topological molecular descriptors. Molecules are described by a tree structure representing their major chemical building blocks and the way they are connected. In order to compare two molecules, efficient mapping algorithms are used to generate a matching between groups of molecules with similar steric and chemical properties. We extended the pairwise Feature Trees method to multiple ligand comparisons.
In a preprocessing step, we first reduce and refine the huge databases containing 10^5 molecules (Aventis, in-house). In a first step all active molecules (defined by an activity threshold) are clustered. We computed with Feature Trees the full pairwise similarity matrix and used Jarvis-Patrick clustering . This method is suited for handling large data sets. Two molecules belong to the same cluster if the overlap between their neighborhood lists (lists sorted to increasing distance, length k) is greater than j. k and j are parameters selected by the user (typically 14/8). For each cluster center (the molecule with the maximal sum of similarities to all cluster members) all highly similar inactives (similarity exceeding a threshold) were retrieved from the full database. This selection of molecules represents the activity region(s). In order to separate different structural classes within the actives, the selected molecules are clustered a second time with Jarvis-Patrick clustering combined with hierarchical clustering algorithms.
Based on the resulting clustering, multiple topological alignments were computed. We developed two new pairwise methods to build incrementally a model of all selected molecules. The first strategy is to assemble a hierarchical model bottom up from a cluster dendrogram. Therefore in the first step a cluster dendrogram of the selected molecules is created. Starting at the bottom of the dendrogram the molecules are pairwise combined into a model. At the next hierarchy level the models are also combined pairwise to new models. The second strategy is to incrementally add single molecules to the model (starting with two molecules). This step is iterated. Because of the order dependence of the second procedure, we developed heuristic orderings. The main advantage of these incremental methods is, that it is much faster than a multiple model generation considering all molecules at once.
In each alignment the matched fragments of the different molecules were analyzed to determine common core and variable regions of the alignment. By aligning a molecule from the database to the model a vector of similarities is generated by matching molecule fragments to parts of the model. These vectors combined with the binding affinities or binary labels (active or inactive) were submitted as input to a support vector machine with a linear kernel, that was used to create qualitative or quantitative models of activity. Due to the linear kernel function the trained weights can be directly associated with fragments. High positive or negative weights indicate fragments that confer or inhibit biological activity respectively and can help the chemist to develop new ideas for lead structures.
The trained weights can be used to reweight the pairwise comparisons leading to a model-induced similarity matrix. We defined a new pairwise similarity score s(a,b|M): the similarity of the two molecules a, b is computed with respect to the model M by aligning a and b with M. The new score s(a,b|M) can be efficiently computed for the whole matrix (actives versus inactives). For easy use and evaluation of models we developed a graphical representation of models. The Feature Trees of the model are aligned on a hexagonal board. Each field represents an alignment of fragments and can be colored according to the trained weight.
A QSAR test database was assembled from the literature  consisting of 10 diverse inhibitor classes and 545 molecules. The class sizes lie in the range from 20 to 138 molecules. As inactives about 50.000 molecules from the Maybridge database  were chosen. All pair similarities between the inhibitors and the molecules from the Maybridge database were calculated (approximately 30.000.000 comparisons) on a workstation cluster of 10 SUN UltraSPARC computers over night. We could correctly classify each active from the QSAR set with a nearest neighborhood method. Within a similarity radius of 0.9 only molecules of the same activity were found. As a first test case of the model generation and application, we focussed on the class of benzamidin inhibitors. After recomputing the similarity matrix of the QSAR data with a benzamidin model, we could see a reduction of noise and a focusing on the benzamidin class. As a test case for the support vector machine (SVM) analysis we assembled a test set of thrombin inhibitors with binding affinities  and a second larger set of thrombin inhibitors without affinities. We trained a SVM on the second set and the maybridge database for classification and we submitted the first set to a SVM for regression analysis.
The comparison of fragment weights obtained with different machine learning approaches shows that the insight obtained by analysis of the data based on classification or regression is complementary and therefore both techniques are relevant for the generation of new leads. For example while classification favours the constant core of the structural class, regression will highlight fragments contributing to the diversity. Further, it is possible to obtain combined models that allow a weighting of the relative relevance of "conserved" versus diverse regions of the structural class.
 M. Rarey and J.S. Dixon. Feature trees: A new molecular similarity measure based on tree matching. Journal of Computer-Aided Molecular Design, 1998 12, 471-490,
 R.D. Brown and Y.C. Martin. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 1996, 36, 572-584
 Maybridge Catalog. Maybridge plc, Trevillett, Tintagel, Cornwall PL34 OHW, England, http://www.maybridge.co.uk
 D.R. Lowis. HQSAR A New, Highly Predictive QSAR Technique, Tripos, Technical Notes, 1997, 1, 1-7
 M. Boehm, J. Stuerzebecher and G. Klebe. Three-dimensional quantitative-activity relationship analyses using comparative molecular field analysis and comparative molecular similarity indices to elucidate selectivity differences of inhibitors binding to trypsin, thrombin and factor Xa. J. Med. Chem. 1999, 42, 458-477