by ZBH

Keep me informed

View page in

format for printing

P-2 : Descriptors of Chemical Reactivity and Application to Mutagenicity Prediction

Joao Aires-de-Sousa; Universidade Nova de Lisboa, Caparica, PT
Qing-You Zhang, Universidade Nova de Lisboa

Mutagenicity is strongly related to chemical reactivity, namely to the ability of a compound to be metabolically activated and to react with DNA. [1] Chemical reactivity depends on the properties of chemical bonds, which determine how bonds break and rearrange in the presence of certain reactants, catalysts and conditions.

In this communication we will show our studies with descriptors of molecular reactivity (physicochemical properties of bonds) for the prediction of mutagenicity in Salmonella (Ames assay). Those empirical descriptors are easily calculated from the molecular structure and can be quickly generated for large data sets of compounds.

In order to use the information concerning several properties of bonds for an entire molecule, and at the same time to keep its representation within a reasonable fixed length, all the bonds of a molecule are mapped into a fixed-length 2D self-organizing map.

A self-organizing map (SOM) is trained beforehand with a diversity of bonds from different structures (each bond described by seven bond properties calculated by PETRA [2]). Then all the bonds of one molecule are submitted to the trained SOM, and the pattern of activated neurons is interpreted as a map of the reactivity features of that molecule (MOLMAP) – a fingerprint of the bonds available in that structure.

MOLMAP descriptors were generated for 548 compounds, and were complemented with 17 general molecular descriptors such as the molecular weight, maximum charge, or ring strain energy. On their basis, a random forest established a predictive model for mutagenicity. Learning in a random forest results from training an ensemble of classification trees. [3] Each tree is grown with a random subset of descriptors and a random subset of objects. The final prediction is obtained by majority voting. Random forests additionally associate a probability to every prediction, and report the importance of each descriptor in the global model.

We used data from the Berkeley Carcinogenic Potency Database [4] consisting of SMILES strings and the corresponding outcome of the Ames test. [5] After excluding inorganic and organometallic compounds, salts, duplicates, and structures not accepted by PETRA 3.11, [2] the remaining 548 structures were partitioned into a training and a test set with 445 and 103 objects respectively. Correct predictions were achieved for 81-84% of the independent test set, and an internal cross-validation error of 22% was obtained for the training set (out of bag estimation). These results compare well with the experimental interlaboratorial reproducibility error of ca. 15% usually associated with the Ames assay. [6]

Inspection of the results reveals that the MOLMAP descriptors do not simply correspond to a code of structural fragments.  The model has some ability to base predictions for unknown functional groups on the detection of reactivity sites.


  1. For a revision on QSAR for predicting mutagenicity see: G. Patlewicz; R. Rodford; J. D. Walker. Environ. Toxicol. Chem. 2003, 22, 1885-1893.
  3. V. Svetnik; A. Liaw; C. Tong; J. C. Culberson; R. P. Sheridan; B. P Feuston. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.
  5. Downloaded from . Version 15Oct03.
  6. J. Kazius; R. McGuire; R. Bursi. J. Med. Chem. 2005, 48 (1), 312 -320.



P-4 : Calculation of Interaction Energies Between DNA and Fluorescent Materials by Using Molecular Orbital Calculations

Mitsuyo Aota; Yamaguchi University, Ube, JP
Kenzi Hori, Yamaguchi University

The interaction between DNA and a fluorescent material has been investigated for a long time. These studies usually use molecular dynamic (MD) or Monte Carlo (MC) simulations which adopt empirical force fields. As the programs for the simulations such as Amber, CHARMM, BOSS and so on have good parameters for DNA, they succeeded in producing dynamic features of complexes of DNA and fluorescent materials (FMs).However, it is necessary to always make parameters for each FM when new FMs are designed for a specific DNA sequence. We have to perform enormous trials calculating interaction energies for different combinations of DNA and FM. These calculations are required to make potential parameters which have to produce reasonable interaction energies for the complexes although this should be very difficult. When we use molecular orbital (MO) calculations for this purpose, no parameterization is required. In the present study, we adopted MO calculations to investigate interactions between FMs and eight DNA hexamers with sequences such as AAAAAA, TTTTTT, AAATTT, TTTAAA, ATATAT, TATATA, GGGGGG and CCCCCC. The specific interaction between DNA and FMs, Hoechst33342, Hoechst33258, DB183, DB210, Netropsin were investigated. Docking of FMs into the minor groove of DNA was carried out using the BioMed CAChe program. This program was also used for optimizing geometries of the DNA complexes at the PM3 level of theory. Interaction energies at the RHF/6-31G//PM3 level of theory were also calculated and compared with those of the PM3 calculations.



P-6 : Development of the Total System ToMoCo for 3D-QSAR and Molecular Design

Masamoto Arakawa; University of Tokyo, Bunkyo-ku, JP
Kimito Funatsu, University of Tokyo

In our laboratory, methodologies for quantitative structure-activity relationships and molecular design are investigated and several related softwares have been developed. And recently, we started development of the total system ToMoCo by integrating these softwares. By using this system, various analyses in common user interface become possible and the result of analysis can be easily interpreted with computer graphics. The ToMoCo includes some useful functions such as molecular alignment method using Hopfield Neural Network, QSAR by CoMFA, region selection by Genetic Algorithm (GA) in CoMFA, automatic drug-like structure generation under restriction of QSAR model. In this conference, we will introduce these functions of the ToMoCo and some applications.



P-8 : Incorporating Conformational Flexibility into QSAR: Validation of a Novel Alignment-Independent 4D-QSAR Technique

Knut Baumann; University of Wuerzburg, Wuerzburg, DE
J. Scheiber, University of Wuerzburg
N. Stiefl, Eli Lilly

A novel molecular descriptor called xMaP (extended MaP descriptor) is introduced and validated. The descriptor is the 4D extension of the previously published alignment-independent MaP descriptor (Mapping Property distributions onto the molecular surface) [1]. In addition to MaP, xMaP is to a great extent independent of the chosen starting conformation of the encoded molecules. This is achieved by using ensembles of conformers which are generated with molecular dynamics simulations or by conformational searches. This step of the procedure is similar to Hopfinger’s 4D-QSAR [2].

A five step procedure is used to compute the xMaP descriptor. First, the conformer ensemble for each molecule is generated. Next, for each of the conformers the molecular surface is computed. Third, molecular properties are projected onto this surface. Afterwards, the properties are assigned one of the following property categories: H-bond acceptor/donor, hydrophilic, weakly/strongly hydrophobic. Fourth, areas of identical properties are merged to surface patches. Finally, the distribution of the patches representing surface area size and surface properties are converted into an alignment-independent descriptor which is based on potential 2 point pharmacophores. The latter step uses the information of the entire conformer ensemble.

To systematically study the influence of several important operational parameters, the novel descriptor was applied to several data sets. The results were compared to the original Map procedure [1] and to 4D-QSAR [2]. It turns out that xMaP is more robust than MaP. In addition to that it is an alignment independent descriptor as opposed to Hopfinger’s 4D-QSAR. The results expressed as average over many test set predictions (R2Test) are quite satisfactory and range between 0.5 and 0.7. Although a huge amount of structural information is encoded, the novel descriptor remains interpretable. Data processing for the interpretation step is challenging and various strategies for this purpose will be presented.

  1. N. Stiefl, K. Baumann, J. Med. Chem. 2003, 46, 1390-1407.
  2. A.J. Hopfinger, S. Wang, J.S. Tokarski, B. Jin, M. Albuquerque, P.J. Madhav, C. Duraiswami. J. Am. Chem. Soc. 1997, 119, 10509-24.



P-10 : Structure-Based Predictions of 1H NMR Chemical Shifts and Coupling Constants Using Associative Neural Networks

Yuri Binev; Bulgarian Academy of Sciences/Universidade Nova de Lisboa, Caparica, PT
João Aires-de-Sousa, Universidade Nova de Lisboa

Fast and accurate predictions of 1H NMR spectra of organic compounds are highly desired particularly for automatic structure elucidation or validation. The large amount of compounds prepared in parallel syntheses need to be analysed and the structures of the products need to be verified. 1H NMR plays an important role in this endeavour and the simulation of spectra to compare with the experimental spectra is of high interest.

The SPINUS program has been developed for the prediction of 1H-NMR chemical shifts from the molecular structure. It is based on ensembles of Feed-Forward Neural Networks (FFNN), which were trained using a series of empirical proton descriptors (physicochemical, geometrical and topological). [1] The FFNNs were incorporated into Associative Neural Networks (ASNN). [2] An ASNN corrects a prediction obtained by the FFNNs with the observed errors for the k nearest neighbours in an additional memory. The additional memory consists of a list of protons and the corresponding experimental chemical shifts. The search for the k nearest neighbours is performed in the output space, i.e. they are the k protons with the most similar profile of outputs (each output comes from a different FFNN of the ensemble).

In this poster we evaluate a procedure to estimate coupling constants with the ASNN previously trained for chemical shifts. Now a memory of coupled protons and the corresponding coupling constants is built. The output profiles from the ASNNs are used for the prediction of coupling constants. To obtain a prediction for the coupling constant between two protons, the output profile is obtained for both protons, and the memory of coupling constants is searched to find the most similar pair of coupled protons. The prediction is based on the experimental values found.

The web-based tool for predicting 1H NMR chemical shifts and coupling constants and for simulating spectra will be presented.

ACKNOWLEDGEMENTS. Y.B. acknowledges Fundação para a Ciência e Tecnologia (Lisbon, Portugal) for financial support under a postdoctoral grant (SFRH/BPD/7162/2001).


  1. Y. Binev, J. Aires-de-Sousa, "Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks", J. Chem. Inf. Comp. Sci. 2004, 44, 940-945.
  2. Y. Binev, M. Corvo, J. Aires-de-Sousa, "The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks", J. Chem. Inf. Comp. Sci. 2004, 44, 946-949.



P-12 : Optimising the Effectiveness of Similarity Measures Based on Reduced Graphs

Kristian Birchall; University of Sheffield, Sheffield, GB
Val Gillet, Stephen Pickett, and Gavin Harper, University of Sheffield

Similarity searching is widely used in an attempt to identify molecules that exhibit biological activity similar to a query molecule. Traditional approaches to similarity searching that are based on 2D descriptors tend to identify compounds that are structural analogues of the query. However, functional similarity is not limited to structural similarity, and, consequently, there is considerable interest in developing descriptors to identify compounds that share biological activity yet belong to different chemical series. Reduced Graphs (RGs) are one such descriptor. RGs summarise a molecular graph by grouping atoms into nodes based on properties that are likely to be important for bioactivity (H-bond donors/acceptors, aromatic rings etc.). Thus, they emphasis the properties of molecules in a type of topological pharmacophore. In previous work, RGs have been mapped to fingerprints and similarity has been quantified using the Tanimoto coefficient applied to the finge rprints (Gillet et al. 2003, Barker et al. 2003). The RGs were found to increase the diversity of actives retrieved in similarity searches compared to using more conventional descriptors such as Daylight fingerprints. Their performance has been improved further by combining fingerprint similarity with edit-distance similarity, in a method devised by Harper et al (2004).

The edit-distance approach to quantifying the similarity between two RGs involves extracting linear paths of nodes from the RGs and finding the minimum cost required to transform the paths from one molecule to those derived from the other, as determined using dynamic programming. There are three basic types of transformation; insertion, deletion and substitution of nodes and each transformation can be assigned a different cost according to the perceived severity of the operation with regards to the specific node types involved. For example, transforming an acyclic joint donor/acceptor node to an acyclic donor node may be given a low cost to signify the functional similarity between the two nodes, whereas transforming the same node to an aromatic ring node may be given a higher cost to reflect the lower functional similarity of the nodes. The edit-distance similarity measure thus copes naturally with insertions, deletions and substitutions when comparing two RGs, something which the fingerprint methods are not able to achieve.

In the published edit-distance method, Harper et al (2004) assigned the penalties based on intuition. Here, we describe the use of a Genetic Algorithm to evolve penalties that maximise the enrichments found in similarity searches. Results are presented for several different activity classes taken from the MDDR and they show significant improvement over the original penalties. The derived penalties also offer insights into the relative importance of features in active molecules and could provide suggestions for possible replacements of groups in the design of novel compounds.

  1. Gillet, V.J., Willett, P., Bradshaw, J. (2003) Similarity Searching Using Reduced Graphs. Journal of Chemical Information and Computer Sciences 43, 2003, 338-345.
  2. Barker, E., Gardiner, E., Gillet, V.J., Kitts, P., Morris, J. (2003) Further Development of Reduced Graphs for Identifying Bioactive Compounds, Journal of Chemical Information and Computer Sciences 43, 346-356.
  3. Harper, G. Bravi, G.S.,  Pickett,  S.D., Hussain, J., Green, D.V.S. (2004) The Reduced Graph Descriptor in Virtual Screening and Data-Driven Clustering of High-Throughput Screening Data, Journal of Chemical Information and Computer Sciences, 44, 2145-2156.



P-14 : Generation of a Focussed Set of GSK Compounds Biased Towards Ligand-Gated Ion Channel Ligands

Anna Maria Capelli; GlaxoSmithKline, Verona, IT
Aldo Feriani, Giovanna Tedesco, and Alfonso Pozzan, GlaxoSmithKline

Several “datamining” methodologies have been recently developed to bias compound selection and library design for generic therapeutics targets (i.e. antimicrobials, anticancer agents etc.) in order to improve the effectiveness of high-throughput screening in the discovery of novel leads [1]. Among them, substructural analysis has been reported as a methodology that allows the identification of structure-activity relationships of large and disparate data sets, characterised by qualitative and quantitative activity [2].A “datamining” methodology based on substructural analysis and standard 1024 Daylight fingerprint as descriptors, successfully applied previously both to antibacterials [3] and 7TM ligands [4] was applied to a set of known antagonists of a sub-family of ligand-gated ion channels comprising nAChRs, 5-HT3, GABAA and GlyR receptors [5]. The derived scoring function was used to generate a focussed set that was screened for alpha7 nAChR, resulting in the identification of novel and chemically tractable alpha7 ligands. Finally, the same scoring function was applied retrospectively to other in house sets screened for the same target in the same assay. Results and performance of the method are presented in details.

  1. a) Gillet V.J.,Willet P., Bradshaw J., J. Chem. Inform. Com. Sci.,1998, 38, 165-179; b) Ajay, Walters W.P.,Murcko M.A., J. Med. Chem.,1998, 41,3314-3324; c) Sadowski J., Kubinyi H., J. Med. Chem., 1998, 41, 3325-3329; d) Ghose A.K., Viswanadhan V.N.,Wendelowski J.J., J. Comb. Chem., 1999, 1,55-67; e) Harper G., Bradshaw J., Gittins J.C., Green D.V., Leach A.R. J.Chem.Inf.Comput.Sci. 2001, 41, 1295-1300
  2. a) Hert J., Willet P., Wilton D.J., J.Chem.Inf.Comput.Sci. 2004, 44, 1177-85. b) Ormerod A. et al. Quant. Struct.-Act. Relat. 1989, 8, 115.; c) Ormerod A. et al. Quant. Struct.-Act. Relat., 1990, 9, 302.
  3. Feriani A., Pozzan A., Capelli A. and Tedesco G., XVIth International Symposium on Medicinal Chemistry,  September 18-22, 2000, Bologna, Italy , P19.
  4. Tedesco G., Feriani A., Pozzan A., Capelli A.M., EuroMUG2002, September 24-26 2002, Cambridge, UK
  5. Le Novere N., Changeux, J.-P., Nucleic Acids Research, 1999, 27(1), 340-2. See also



P-16 : QSPR Study of Melting Point and Density of Imidazolium Ionic Liquids

Gonçalo Carrera; Universidade Nova de Lisboa, Caparica, PT
Carlos M. Afonso, Universidade Técnica de Lisboa
João Aires-de-Sousa, Universidade Nova de Lisboa

Ionic liquids (IL) are salts with melting points near the room temperature. Their negligible vapour pressures allow for their potential use as environmentally friendly substitutes of organic volatile solvents [1]. Judicious choice of anion and cation permits to obtain IL’s with physical and chemical properties fitted to a specific problem [2]. The first decisive property is the melting point.

Others [3,4] and we [5] have reported QSPR analysis of the melting point of ionic liquids. These works have considered datasets of bromide salts. Here we present QSPR models of density and melting points, which are based on both cationic and anionic descriptors accounting for the diversity of both the cation and the anion of the salts. Random Forests [6] were used for regressions using a pool of near 300 descriptors.

For modeling the melting point we used a dataset of 235 imidazolium salts with mp between -88 and 370ºC, and including six different anions – BF4-, Cl-, Br-, PF6-, CF3SO3-, and NTf2-. The dataset was divided into a training set with 155 objects and a test set with 80 objects. For the QSPR study of density, a dataset of 106 imidazolium salts was collected from the literature, with density ranging from 0.96 to 2.80, and covering five families of anions – BF4-, PF6, NTf2-, CF3SO3 -, CF3CO2- and CH3CH(OH)CO2-. This dataset was partitioned into a 73-objects training set and a 33-objects test set.

Three types of cationic descriptors were used based on 3D molecular structures generated by CORINA [7]: radial distribution function vector, surface spatial autocorrelation function vector and a set of empirical general molecular descriptors such as the molecular weight, maximum charge, or polarizabilities. Several descriptors were defined for the anion: binary descriptors each encoding a specific anion, descriptors based on the molecular weight of the anion, and descriptors based on the miscibility of an anionic family in different solvents. The miscibility descriptors assume that miscibility is mainly determined by the anion of the IL [8]; these descriptors are calculated from the proportion of salts belonging to a specific anionic family that are miscible in a certain solvent.

For the melting point, good correlations were obtained between the experimental and the predicted values for the test set (r2 = 0.80, RMSE = 34 ºC). Excellent predictions were obtained for the density (r2 = 0.98 for the test set).

ACKNOWLEDGEMENTS. G.C. acknowledges Fundação para a Ciência e Tecnologia (Lisbon, Portugal) for financial support under a PhD grant (SFRH/BD/18354/2004).


  1. J. S. Wilkes; Green Chemistry; 2002; 4; 73.
  2. R. Sheldon; Chem. Commun.; 2001; 2399.
  3. A. R. Katritzky, A. Lomaka, R. Petrukhin, R. Jain, M. Karelson, A. E. Visser, R. D. Rogers; J. Chem. Inf. Comput. Sci.; 2002; 42; 71.
  4. D. M. Eike, J. F. Brennecke, E. J. Maginn; Green Chemistry; 2003; 5; 323.
  5. G. Carrera, J. Aires-de-Sousa; Green Chemistry; 2005; 7; 20.
  6. L. Breiman.; Machine Learning; 2001; 45; 5.
  7. J. Gasteiger, C. Rudolph, J. Sadowski; Tetrahedron Comput. Methodol.; 1992; 3; 537.
  8. C. Chiappe, D. Pieraccini; Journal of Physical Organic Chemistry; (early view).



P-18 : New Descriptors from Energy Decomposition in Semiempirical Level

Alexandre Carvalho; Universidade do Porto, Porto, PT
André Melo, Universidade do Porto

In this work, we used the partition method introduced by Carvalho and Melo [1] which enable the decomposition stabilization energies of molecular association processes into physical meaningful components (conformational rearrangement, non-bonding, bonding and polarization plus charge transfer). This partition scheme has been developed within a semi-empirical formalism, which enables a complete separability of the above-mentioned components. We have study the complex between Cucurbita Maxima trypsin inhibitor (CMTI-I) and glycerol. Every residue was considered a fragment. This computational procedure enables to evaluate the range of the perturbation originated by the association process and evaluate the energetic contribution from each residue. The results obtained enable us to conclude that the present decomposition scheme can be used for understanding the cohesive phenomena and produces a new set of descriptors.

  1. Alexandre R. F. Carvalho, André Melo, “Energy partitioning in association processes”, Int. J. Quantum Chem. (2005).



P-20 : Quantitative Analysis by Spectral Data Transformation in Multivalued Fingerprints and Multivariate Calibration

Gonzalo Cerruela; University of Córdoba, Córdoba, ES
Manuel Urbano Cuadrado, María Dolores Luque de Castro, and Miguel Ángel Gómez-Nieto, University of Córdoba

Spectroscopic techniques employing multichannel detection have become a powerful tool for the characterization of materials. A number of qualitative and quantitative approaches based on the collection of the spectrum (a large data set) in a short time and the subsequent multivariate treatment have been developed aimed at substituting time-consuming and expensive methods. The research carried out in this study is an attempt on improving multivariate calibration based on a new chemometric technique for the quantitative property prediction of samples through preprocessing of spectral data and their transformation in multivalued fingerprint. The transformation of a spectrum in a multivalued fingerprint involves the following steps:

  1. Normalization of the spectral data by standard, logarithmic and maximum methods in order to transform the absorbance matrix into a new data set within the [0,1] range.
  2. Selection of n-1 threshold values taking into account the maximum-minimum range, where n is the number of cases per variable.
  3. Assignment of a given case to each variable if the normalized value surpasses the threshold value.
    Different multivalued transformations of the input spectra (i.e. binary, ternary, quaternary and quintal valued transformation) have been studied and the results compared with each other.

The work presented here deals with the study of statistic parameters of PLSR equations built with discrete spectra as data matrix aimed at enlarging spectral differences. The parameters employed for this study were the Determination Coefficient (R2), Standard Error in Cross Validation (SECV), bias and slope. The results were compared with those obtained by authors without the proposed preprocessing (In Comparison and Joint Use of Near Infrared Spectroscopy and Fourier Transform Mid Infrared Spectroscopy for the Determination of Wine Parameters. Talanta. Accepted for publication. Available on-line).

Data employed corresponded to the 3000-800 cm-1 absorbance spectra of 136 samples of wine. Each spectrum consisted of 1142 predictor variables. The properties studied were total acidity and the content of reducing sugars, which were determined by titration to obtain the reference values. Model fitting was carried out by cross validation — six series in which the training and validation sets were composed by 116 and 20 samples, respectively, in such way that all the samples were used for validation.

The maximum normalization showed the best statistic parameters. Regarding the dimension, the ternary spectra drove to the best determinations. All the statistic parameters were improved using the proposed preprocessing with the exception of R2 for total acidity. Bias was the most improved parameter, close to zero for the two properties.  The software employed for spectra normalization and building the discrete spectra was developed by the authors in the C programming language. The Unscrambler 7.8 (Camo Process AS, Oslo, Norway) was used for developing PLSR equations.



P-22 : Study and Display of the Effect of Structural Similarity Approach in the Screening of Chemical Databases

Gonzalo Cerruela; University of Córdob, Córdoba, ES
Manuel Urbano Cuadrado, Miguel Ángel Gómez-Nieto, and Irene Luque Ruiz, University of Córdob

Similar structures have similar properties is a fact well known and widely accepted by the scientific community. Molecular similarity can be assessed in conceptually different ways including a variety of algorithms, metrics and high-level description of molecular structure, properties and conformation. Usually, molecules are represented by means of molecular graphs and then a structural similarity measure can be obtained. This measure is generally based on considering the set of common subgraphs (MCES (Maximum Common Edge Subgraphs)) and to apply some of the well-known similarity metrics (Tanimoto, Cosine, Simpson, etc.) which consider the sizes (nodes and edges) of the molecular graphs that are compared and the size of the common subgraphs.

The use of the measure of structural similarity has been applied to the prediction of properties, clustering, and screening of chemical databases. However, when the approach of structural similarity is used in the screening processes, it is observed that the size of the molecules and the utilized index of similarity affect considerably the number of recovered molecules for a given threshold of similarity.

In this work we present a study of the behavior of some of the structural similarity metrics as a function of the characteristics of the molecules and the approach or algorithm used in the calculation of the structural similarity. We carry out an analysis of the calculation of the structural similarity based on the MCS (Maximum Common Subgraph) with regard to the MCES (generally utilized) and we observe the dependence that exists among the different similarity indexes as a function of parameters as: graph size (number of nodes and edges), size of the common subgraph, relationship among the sizes of the MCS and MCES, etc.

The results obtained show information on the thresholds of similarity as a function of the different similarity indexes and the structural characteristics of the set of recovered molecules that should be used in the processes of screening of chemical databases. The relationships obtained allow us to establish the maximum and minimum threshold values for different similarity indexes in the screening process with the purpose of recovering molecules that: a) only contain a complete substructure equal to the search criteria, b) contain substructures that exist partially (as connected or non-connected to each other), and c) to delimit the size or relationship between the MCS and MCES among the search criteria structure and the recovered molecules.



P-24 : Generation of Multiple Pharmacophore Hypotheses Using a Multiobjective Genetic Algorithm

Simon Cottrell; University of Sheffield, Sheffield, GB
Valerie J. Gillet, University of Sheffield
Robin Taylor, Cambridge Crystallographic Data Centre

Pharmacophore elucidation involves identifying a three-dimensional arrangement of features that is common to a set of ligands with the same biological activity.  This normally involves superimposing the ligands such that functional groups relevant to biological activity are overlaid.  Since most drug-like molecules are conformationally flexible, the conformational space of each ligand must be searched.

Existing methods for pharmacophore elucidation generally fall into one of two categories.  Firstly, there are methods that attempt to exhaustively generate all pharmacophore hypotheses from pre-generated sets of conformers.  Secondly, there are methods that return a single solution that optimises a scoring function, which is usually an arbitrary combination of several objectives.  These methods usually search the conformational space dynamically during the overlay, so as to find the set of conformations that produces the best overlay. The former methods generally return a large number of solutions which take considerable effort to analyse and are highly dependent on the method used to generate the conformers, whilst the single solution returned by the latter methods implies an unrealistic degree of certainty in the result.

This work has involved applying a multiobjective genetic algorithm (MOGA) to the pharmacophore elucidation problem [1].  The MOGA evaluates solutions quantitatively; however, it does not score solutions using a single fitness function but considers each objective independently, according to the principles of Pareto dominance.  Three objectives are considered in evaluating hypotheses, namely the closeness of the alignment of features in the different ligands, the volume overlap of the ligands and the internal energy of the ligands. The program generates several different hypotheses which represent different, but equally valid compromises between the objectives.

An important aim of this work has been to generate a set of solutions that are diverse from a biochemical point of view. Ensuring a diverse range of different compromises between the three objectives has proved to be a necessary but not sufficient condition for achieving this.  Considerable effort has therefore been directed towards explicitly taking into account chemical diversity within the MOGA population.  A recent development of the program has been to allow the identification of pharmacophore features that are common to some, but not all of the ligands.

The results of the MOGA are illustrated using datasets for binding sites of pharmaceutical interest. In each case, the MOGA generates a manageable number of different hypotheses. Thus, it takes a realistic view of the uncertainty which is inherent when the binding site structure is unknown, but still allows a quantitative comparison of the hypotheses that it generates.

  1. Cottrell, S.J.; Gillet, V.J.; Taylor, R.; Wilton, D.J. Generation of Multiple Pharmacophore Hypotheses. Journal of Computer-Aided Molecular Design, in press.



P-26 : Advanced Structural Search Using ChemAxon Tools

Ferenc Csizmadia; ChemAxon, Budapest, HU
Gyorgy Pirok, Szilard Dorant, Miklos Vargyas, Peter Kovacs, Nora Mate, and Szabolcs Csepregi, ChemAxon

Structural search techniques are invaluable tools in all cheminformatics systems including but not limited to rational drug design, compound registration systems and laboratory information management systems.

JChem, one of ChemAxon’s major suites of programs, provides a very rich set of features related to structural search. These features are demonstrated by examples. Covered topics are: substructure, exact, superstructure, MCS (maximum common substructure) and similarity search.

Reaction and R-group search (including R-logic) are also available, which are complemented by a rich set of query features. SMARTS and query features of the MDL formats are supported. An example of a fast MCS-based clustering is also presented. Finally the recently developed descriptive Chemical Terms Language is demonstrated by powerful structural searches.



P-28 : Neural Networks for the Prediction of 1H NMR Chemical Shifts of Sesquiterpene Lactones

Fernando Da Costa; Universitaet Erlangen-Nuernberg, Erlangen, DE
Y. Binev and J. Aires-de-Sousa, Universitaet Erlangen-Nuernberg
J. Gasteiger, Universitaet Erlangen-Nuernberg

The sesquiterpene lactones (STLs) comprise a large group of natural products, which are basically found in plant species of the family Asteraceae. They have ecological importance, show several biological activities and are taxonomic markers of this family [1]. 1H NMR spectroscopy plays a crucial role in the structure elucidation of STLs in all the studies concerning this class of compounds.  This work describes the estimation of 1H NMR chemical shifts of STLs using the SPINUS program [2], which is based on associative neural networks (ASNN). This system incorporates an ensemble of backpropagation neural networks (previously trained with a general set of structures and the corresponding chemical shifts) and an additional user-defined memory [3,4]. Several physicochemical, geometric and topological descriptors were used to represent the hydrogen atoms of the compounds. In a previous work, 392 1H NMR experimental chemical shifts of 20 STLs were used as additional memory to predict the chemical shifts of two different STLs. The results showed a high level of accuracy [5].

In the present work, the prediction of 1,902 chemical shifts from 100 structures of STLs was made using the same additional memory described above. The methylenic sp3 diasterotopic protons belonging to rigid substructures, as well as sp2 protons of C-C double bonds of 3D structures, can be distinguished by the geometrical descriptors. The inclusion of the user-defined additional memory led to a considerable improvement in the accuracy of the predictions. When no memory was used, the mean absolute error (MAE) for the predictions of the 1H NMR chemical shifts of the 100 STLs was 0.27 ppm. Using only the user-defined additional memory (392 1H NMR chemical shifts of 20 STLs), the MAE was improved to 0.23 ppm. When the user-defined memory was combined with a large general memory, a MAE of 0.22 ppm was achieved.


  1. F.C. Seaman. Bot. Rev. 48:123-551, 1982.
  2. A web interface of SPINUS is freely accessible at and
  3. Y. Binev, J. Aires-de-Sousa. J. Chem. Inf. Comput. Sci. 44:940-945, 2004.
  4. Y. Binev, M. Corvo, J. Aires-de-Sousa. J. Chem. Inf. Comput. Sci. 44:946-949, 2004.
  5. F.B. Da Costa, Y. Binev, J. Gasteiger, J. Aires-de-Sousa. Tetrah. Lett. 45:6931-6935, 2004.


    F.B.C. acknowledges Alexander von Humboldt-Stiftung. Y.B. acknowledges Fundação para a Ciência e Tecnologia (Lisbon, Portugal) for a post-doctoral grant under the POCTI program (SFRH/BPD/7162/2001). J.A.S. thanks Deutscher Akademischer Austauschdienst (DAAD) for travel grants.



P-30 : Indexing the Chemical Semantic Web

Nick Day; Cambridge University, Cambridge, GB
Peter Murray-Rust, Cambridge University

The semantic web is a vision where information can be retrieved and analysed robotically using agreed metadata and indexes. Molecular information is ideally suited for this and we report the use of the recently developed InChI (International Chemical Identifier) (to be released very shortly). InChI is an IUPAC project whose aim is to create a non-proprietary unique identifier for chemical structures to enable easier linking of diverse electronic data compilations. We have recently shown that the InChI is a powerful and precise tool for indexing chemical structures both in databases and on the web.

We have shown that the structure and information held in an InChI confer several advantages over current approaches. It provides layers describing degree of certainty in our knowledge of chemical identity which can allow variable precision in the queries. Our analysis of web indexes show that it provides a robust query in today\'s search engines. Documents combining Chemical Markup Language with InChIs have both high recall and high precision making them the natural choice for publishing and hence building the Chemical Semantic Web. We have implemented a variety of web services for generating and using InChI.

We thank Steve Stein, Steve Heller, Dmitrii Tchekhovskoi (NIST) and Alan McNaught (IUPAC) for help and advice on InChI.



P-32 : VET: A Tool for Reaction Plausibility Checking

Joseph Durant; Elsevier MDL, San Leandro, CA, US
James G. Nourse, Elsevier MDL

Construction of reaction databases can involve the extraction of chemical reactions from textual descriptions. As part of this process chemical names are converted to structures, reaction roles are assigned to components, and "obvious" facts are supplied. Errors can occur in each of these activities, leading to errors in the extracted reactions. Trapping such errors in "real world" reaction databases is a complex, but important, task.

One problem is that the study of chemical reactivity lacks a simple and systematic way to partition possible reactions into plausible and implausible sets.  Instead, one is presented with a wealth of rules and examples which make construction of an expert system for reaction planning a challenging endeavor.

Further complicating the task are the properties of real world data. For example, it is common to represent a reaction as an unbalanced reaction, where one or more products (the "uninteresting ones") are not represented. In this way the reaction representation can highlight the important features of the reactions. However, the increased clarity resulting from this flexibility in reaction representation introduces ambiguity into the interpretation of the reaction, and complicate the evaluation of its plausibility.

In order to support creation of new reaction databases we have created a program, VET, to evaluate the plausibility of chemical reactions to be included in the database.  This method focuses on trapping common classes of input and representation errors and minimizing the occurrence of incorrect reactions allowed into the database. VET is presently in use at Elsevier MDL as part of the Patent Chemistry database creation workflow.

We will discuss the various strategies developed, as well as their relative strengths and weaknesses.



P-34 : Accurate Geometry Optimization Method for Molecular Mechanics

Ödön Farkas; Eötvös Loránd University, Budapest, HU

The present method [1] allows the efficient utilization of the full power of Newton optimization techniques in molecular mechanics. The examples show the weakness of the currently used optimization methods as they usually provide much higher energy local minima, started from the same structure, due to the lack of their accuracy. The proposed algorithm with applying multiple optimization criteria, which are regularly used in quantum chemistry [2], can result final RMS forces less then 10^-6 kcal/mol/Angström, a previously impossible goal for biomolecules. The accurate geometry optimization is also essential as part of QM/MM, ONIOM [3,4] procedures.

  1. Farkas, Ö "Fast and robust geometry optimization algorithm for large systems", CESTC 2004, Tihany, Hungary
  2. Gaussian 98, Revision A.7, M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, V. G. Zakrzewski, J. A. Montgomery, Jr., R. E. Stratmann, J. C. Burant, S. Dapprich, J. M. Millam, A. D. Daniels, K. N. Kudin, M. C. Strain, Ö. Farkas, J. Tomasi, V. Barone, M. Cossi, R. Cammi, B. Mennucci, C. Pomelli, C. Adamo, S. Clifford, J. Ochterski, G. A. Petersson, P. Y. Ayala, Q. Cui, K. Morokuma, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. Cioslowski, J. V. Ortiz, A. G. Baboul, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. Gomperts, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, C. Gonzalez, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, J. L. Andres, C. Gonzalez, M. Head-Gordon, E. S. Replogle, and J. A. Pople, Gaussian, Inc., Pittsburgh PA, 1998.
  3. Torrent, M.; Vreven, T.; Musaev, D. G.; Morokuma, K.; Farkas, Ö.; Schlegel, H. B. \"Effects of the protein environment on the structure and energetics of active sites of metalloenzymes. ONIOM study of methane monooxygenase and ribonucleotide reductase\", Journal of the American Chemical Society 2002, 124, 192-193.
  4. Vreven, T.; Morokuma, K.; Farkas, Ö.; Schlegel, H. B.; Frisch, M. J. \"Geometry optimization with QM/MM, ONIOM, and other combined methods. I. Microiterations and constraints\", Journal of Computational Chemistry 2003, 24, 760-769.



P-36 : “Ultra-fast” Ligand-based de novo Design Using Virtual Reaction Schemes

Uli Fechner; Goethe-Universitaet Frankfurt, Frankfurt, DE
Gisbert Schneider, Goethe-Universitaet Frankfurt

We developed a software tool that is capable of performing a virtual retro-synthesis of compounds following the RECAP [1]. The employed set of eleven common organic reactions was specified in the SMIRKS language. The virtual retro-synthesis was carried out on a dataset of approximately 5000 drug molecules [2]. The SMILES strings of the resultant fragments were labeled to allow for the storage of the position where the reaction took place and the type of the reaction. These fragments were then used as building blocks in our ligand-based de novo design program Flux (Fragment-based ligand builder reaxions) which is grounded on the TOPAS method [3]. The same set of virtual reaction schemes guided the assembly of candidate compounds thereby leading to an increased chance of designing structures that are synthetically accessible. Molecular similarity between a known active compound for a particular biological target (template) and the candidate compounds served as a scor ing function. The molecular similarity was calculated using Daylight Fingerprints and the Ghose & Crippen substructure fingerprints as descriptors and the Tanimoto Index and the Euclidian Distance as similarity indices. The ligand-based scoring function facilitates the application of our design approach where the three-dimensional structure of the biological target is not available, as is the case, for example, with the large group of G-protein coupled receptors. An evolutionary algorithm with a specifically tailored mutation operator was accountable for navigation through the fitness landscape. Both the program for virtual retro-synthesis and our de novo design software extensively rely on the Daylight Toolkit (

We evaluated our method with two retrospective design studies: Gleevec, a Abelson tyrosine kinase inhibitor, and a Factor Xa inhibitor synthesized with the four-component UGI reaction served as molecular templates. In case of Gleevec our program was able to re-assemble the template structure with our set of building blocks and virtual reaction schemes. For both molecular templates Flux proposed several candidate compounds with interesting chemical moieties. Visual inspection supported our hypothesis that the likelihood of chemical accessibility indeed is increased with our design approach.


  1. Lewell, X.O., Budd, D.B., Watson, S.P. & Hann, M.M. RECAP – Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry, J. Chem. Inf. Comput. Sci. 1998, 38, 511-522.
  2. Schneider, P. & Schneider, G. Collection of Bioactive Reference Compounds for Focused Library Design, QSAR Comb. Sci. 2003, 22, 713-718.
  3. Schneider, G., Lee, M.-L., Stahl, M. & Schneider, P. De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J. Comput. Aided Mol. Des. 2000, 14, 487-494.



P-38 : The Nuclear Receptor Ligand Binding Domain: A Family-Based Structural Analysis

Simon Folkertsma; University of Nijmegen, Nijmegen, NL
Gert Vriend, University of Nijmegen
Paula van Noort, Ralph Brandt and Jacob de Vlieg, Organon NV
Emmanuel Bettler, BPCP

The huge amount of sequence and structural data on nuclear receptors requires automated methods to classify and analyse the role of key amino acids in the nuclear receptor ligand binding domain. By means of automated structural analysis we identified (1) frequent ligand binding residues, (2) important homo- and hetero-dimerisation residues and (3) selective cofactor binding residues. The ligand contact data shows that frequent ligand binding residues are mainly hydrophobic and form a core pocket in the nuclear receptor ligand binding domains. The identity of these frequent ligand binding residues determines the shape and the physico-chemical properties of the ligand binding pocket. Ideally, these properties are compatible with the physico-chemical properties of the ligand. Dendograms based on the most important ligand binding residues suggest novel potential cross reactivity of ligands between different subfamilies of nuclear receptors. In addition, subfamily selective ligand binding residues can be used to guide the docking of novel ligands in these subfamilies. Finally our contact analysis revealed positions in the ligand binding domain that only interacts with antagonists and partial agonists. The contact information was translated into ligand-receptor interaction profiles that are very helpful in the design of selective compounds with a particular desired function.



P-40 : ScafReplace: Novel Tools for Scaffold Replacement

Patrick Fricker; Center for Bioinformatics, Hamburg, DE
Tanja Schulz-Gasch, Martin Stahl and Matthias Rarey, Center for Bioinformatics

A frequently occurring task during lead finding and optimiziation is to replace a central element (linker or ring system) of a compound series. Based on a geometric arrangement of 2-3 exit vectors and additional pharmacophoric features, the task is to find molecular fragments fulfilling these constraints. Here we present a novel approach for this task based on the concepts of 3D-shredding and geometric rank searching.

In a first step, a database of fragments suited for scaffold replacements is derived from a database of 3D structures like the Cambridge Crystallographic Structure Database (CSD). A set of disconnection rules is used to split the molecular structures into molecular fragments by "cutting" the molecules at strategic acyclic bonds.

In order to avoid strained conformations for the suggested replacements, the 3D structural information of the molecules is retained. We avoid the explicit creation of fragments. Instead, we consider all connected fragments that result from any possible combination of cuts within the original compound database. In this way, the need to join fragments decreases and therefore the conformational information in the database is fully exploited.

To discard unwanted fragments on the topological level, filters are used to mark particular combinations of cuts as unwanted. These filters include fragment size, distances between cuts, number of cuts and certain substructure patterns.

Based on this new idea of 3D-shredding, a search method was developed. The method in its first version does not try to combine fragments, but searches for single fragments fulfilling "feature points" from a query. A query consists of two exit-vectors and an arbitrary number of potentially directed features, for example hydrogen-bond donors and acceptors or hydrophobic interactions.

By design, the combinatorial search algorithm finds matching fragments ordered by deviation from query features. This leads to the property that the user does not have to give tolerance ranges for the query features but can specify the number of results to be returned or can get hits incrementally.

We applied the 3D-shredding and rank search algorithm to a subset of 90000 structures from the CSD. The 3D-shredding routine results in 120000 potential fragments with more than 2 cuts. The construction of the database typically takes 2 minutes, a query of the type mentioned above can be processed on this database within 6 seconds for creating the first 100 hits.



P-42 : Genomic Data Analysis Using DNA Structure

Eleanor Gardiner; University of Sheffield, Sheffield, GB
Christopher Hunter and Peter Willett, University of Sheffield

Only 1-2% of the DNA of the human genome codes for proteins. Much of the remainder may be ‘junk’, but comparative genomics suggests that a significant amount must serve some purpose. For example, recent comparisons between the mouse and human genomes found that 5% of the genome is conserved between the two species. This means that in addition to the protein-coding regions about 3% of the genome is likely to be under evolutionary selection for some as yet unknown function. Recent comparative studies have revealed sets of Conserved Non-Genic sequences (CNGs) and sets of Ultra Conserved Elements (UCEs).  CNGs are hundreds of bases long and are much more highly conserved than protein-coding genes: more than a quarter of the CNGs on human chromosome 21 have been found in at least ten other species. UCEs are also hundreds of bases long and are absolutely conserved, between human and mouse, without gaps.  Over half of the UCEs identified have no overlap with any exonic sequences. The extraordinarily high degree of conservation of these sequences strongly suggests their significance and emphasises the need to develop new approaches to understanding the function of non-coding sequences. One hypothesis is that their function could be related to protein binding, as part of a system of DNA repair, gene expression, replication, packaging or scaffold attachment.  One factor that governs protein-DNA interactions and is expressed over length scales of hundreds of bases is structural properties of the DNA. Thus the identification of common DNA structural motifs could allow annotation of the functional properties of non-coding parts of the genome and provide insights into likely protein partners.

We have previously compiled a database of the structural properties of all 32,896 unique DNA octamer sequences, including information on stability, the minimum energy conformation and flexibility. We have used Fourier techniques to analyse the UCEs and CNGs in terms of their octamer structural properties, in order to reveal long-range structural correlations which may indicate possible functions for some of these sequences.



P-44 : Increasing the Efficiency of Chemical Structure Storage and Retrieval in Large Relational Databases

Sasha Gurke; Knovel Corp., Norwich, NY, US
Sergei Trepalin, Institute of Physiologically Active Compounds

Chemical structure search was integrated with full text and fielded text and numeric searches in a large relational database providing consistent web-based access to a variety of technical content, from e-books to property databases.

In the exact chemical structure search, exact molecular weight was used as an index. Molecular weight unambiguously defines molecular formula of an organic molecule with molecular weight less than 1,000.  In addition, two molecular topology sensitive indices were calculated. The 12-byte numbers thus obtained were sorted in descending order and searched using bisection algorithm. Tautomeric structures were converted to canonical format.

In substructure search, 256 screens were used for the initial record filtration. The sorting of the atoms and bonds of molecular fragments was same as that of the whole structures. A molecular formula search (the most efficient for fragments containing non-C atoms) was then run to eliminate irrelevant structures. Thereafter, the atoms and bonds of the fragment were matched sequentially to structures from the pre-filtered records. The results of each attempted match were recorded in a pair of Boolean matrices with the dimensions number of query atoms (bonds) × number of structure atoms (bonds). The matrices were populated with TRUE elements for a match and FALSE elements for a non-match.  As bonds were populated, a reference was made to the atoms matrix to ensure that TRUE bonds in fact join atoms of the correct type. If the matrices contain at least one TRUE element for each atom and bond of the fragment, a back-tracking algorithm was used to determine if it maps to the structure. This algorithm was applied starting with a maximally coordinated non-C atom, providing it was present. This is a critical function in terms of its performance. Although the time required increases with the number of nodes (atoms and bonds) in the fragment (exponential order), the time required to manipulate the Boolean matrices is proportional to their size (polynomial order) and this is a considerable advantage when back-tracking to verify a possible match.

MS SQL server was used to handle the database. Exact structure search algorithm was the standard SELECT command in an SQL expression. Substructure search was created as a stored procedure, COM object being used for subgraph isomorphism search.



P-46 : Analysis of GRID Molecular Interaction Fields

Sandra Handschuh; Boehringer Ingelheim Pharma GmbH, Biberach, DE
Anne Techau Jørgensen, Kerstin Höhfeld, and Thomas Fox, Boehringer Ingelheim Pharma GmbH

The calculation of molecular interaction fields using the program GRID[1] is an important technique in structure-based drug design. These fields can be used to identify favourable interaction hot spots, and thus support the manual design and optimisation of ligands for a given target. The molecular interaction fields also provide the basis for a range of methods, amongst others they have been used to identify regions important to ligand selectivity between various targets (GRID/CPCA) and for the calculation of in silico ADME descriptors (Volsurf).

The CCDC/Astex data set [2], a large publicly available data set of protein-ligand complexes, was used for a systematic analysis of the GRID-generated molecular interaction fields. Following protein classification based on the CATH system [3] comparisons were performed within and between protein classes.

All complexes were characterized by the molecular interaction fields for 10 GRID probes. Then the MIFs were described using a range of parameters, such as lowest energy found within the active site, and the percentage of grid points below specified energy levels. Furthermore, the presence of MIF hot spots around relevant functional groups was probed. Several aspects of the analysis are presented.

  1. Goodford, P. J. J. Med. Chem. 1985, 28, 849-857.
  2. Nissink JWM, Murray C, Hartshorn M, Verdonk ML, Cole JC, Taylor R. A new test set for validating predictions of protein-ligand interaction. Proteins 2002;49:457-471.
  3. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM CATHA Hierarchic Classification of Protein Domain Structures. Structure 1997;5:1093-1108.



P-48 : Structural DNA Profiles

Linda Hirons; University of Sheffield, Sheffield, GB
E J Gardiner, C A Hunter, and P Willett, University of Sheffield

A DNA sequence’s function is commonly predicted by measuring its nucleotide similarity to known functional sets. However the use of structural properties to identify patterns within families is justified by the discovery that many very different sequences have similar structural properties. This means that by looking at the information hidden within the structure, similarities between DNA sequences will be found that would otherwise be unrecognised.

A database containing structural properties of all 32,896 unique DNA octamer sequences has previously been constructed.  The calculated descriptors include the step parameters that collectively describe the energy minima conformations, the force constants and partition coefficients that describe the flexibility of an octamer and several ground state properties such as the RMSD, which measures the straightness of an octamer’s path.

The development of tools that use the contents of the octamer database to identify structural DNA activity fingerprints would be of great value in predicting unknown DNA functions. This is illustrated here by the generation of structural profiles that examine patterns common to a set of pre-aligned promoter sequences. A promoter sequence being one to which RNA-polymerase II binds before travelling downstream to transcribe a gene into messenger RNA.



P-50 : The Study of Bias Fusion of Chemical Similarity Searching

John Holliday; University of Sheffield, Sheffield, GB
Jenny Chan and John Bradshaw, University of Sheffield

This study has carried out experiments on bias fusion of chemical similarity searching based on four coefficients. The four coefficients were identified from set of 13, in a previous study using Naïve Bayes Classifiers, as having the highest retrieval rates in a selection of 20 active size ranges. The study indicated that Forbes and Simple Matching are the best at retrieving smaller size of actives, Tanimoto is the best at retrieving the medium size of actives and Russell-Rao is the best at retrieving the larger size of active. The purpose of this study was to find out whether retrieval performance could be improved by altering the weightings of these four coefficients in the fusion process.

A systematic approach was used to explore all possible combinations of the weights. Previous studies indicated that the choice of coefficients is class-dependent. Therefore, ten classes have been trained using the systematic approach resulting in one best weighted combination for each class. A similar methodology was also applied using the modal fingerprint of each training set rather than the full set of fingerprints.

The best combinations of weightings resulted from both pure systematic and modal fingerprints based systematic approaches were then tested. The measure of retrieval used was the number of the retrieved actives in the top 500 nearest neighbours. These were compared with results using the 13 single coefficients.

The results show that equal-weighting fusion has an average of improvement rate of 19% over Tanimoto, bias fusion using the results of pure systematic approach has an average of improvement rate of 25% and bias fusion using the results of modal fingerprints based systematic approach has an average of improvement rate of 32%.

In a separate experiment, a genetic algorithm has been used to generate class-dependent formulae for similarity searches. The purpose of this study is to find out whether an effective formula can be determined for each active class.



P-52 : A System Fusing Computational and Information Chemistry for Developing New Synthesis Routes of Compounds: An Application to the Synthesis Routes of Tropinone

Kenzi Hori; Yamaguchi University, Ube, JP

Synthesis route design systems have been practically used for more than ten years to create new synthesis routes of compounds. The number of routes diverges for multi-step syntheses as the systems usually offer several routes for each step. It is very difficult for experimental chemists to determine which is the best route for the target compound in the created routes. Quantum mechanical calculations including searches of transition states (TSs) are very effective to clarify possibility of synthesis routes, i.e., if there is the TS for a route, it is possible to synthesize the target by using the route and vice versa. Therefore, we should be able to find useful synthesis routes without experiments by use of the method fusing computational chemistry and information chemistry. The former analyzes reaction mechanisms of synthesis routes which the latter creates. However, there are few studies concerning with the promising method. We have been developing a data based n amed the transition state data base (TSDB) which makes it possible to effectively use the synthesis route system for developing new synthesis routes of compounds. The present study describes how to use the data base for developing new synthesis routes. As an example, we will show the results for synthesis routes of tropinone from the KOSP program using the DFT calculations at the B3LYP/6-31G* level of theory.



P-54 : Universal Scripted Chemical Information Processing: the CACTVS Chemoinformatics Toolkit

Wolf Ihlenfeldt; Xemistry GmbH, Lahntal, DE

While there are many established systems for the handling and manipulation of chemistry-specific data in standardized and structured environments, the researcher often encounters non-trivial problems when ad-hoc needs in the fields of data preparation for import or export, data filtering, or structure and reaction manipulation develop.

The CACTVS Chemoinformatics Toolkit was designed to provide solutions for above scenarios by means of extensive VHLL-scripting functionality with extensible, chemistry-specific high-level objects. Many typical problems encountered in the chemical information processing environment can be solved with just a few lines of script code. We will display a sample of prototypical solutions, with a special focus on Web-related applications.



P-56 : 3D Structure Prediction and Conformational Analysis

Gabor Imre; Eotvos Lorand University, Budapest, HU
Odon Farkas, Eotvos Lorand University

Numerous theoretical methods in the field of computational chemistry falls back on the availability of 3D structural information about compounds. Determining molecular structures without human interaction is an essential component of several techniques, like QSAR, 3D pharmacophore analysis, reaction prediction, etc. Current computational tools used for structure determination including force-fields and quantum chemical methods, even require a complete set of initial 3D coordinates. The efficiency of 3D structure based HTS tools also can beenchanced by employing conformational analysis to yield multiple valid structures.

Our approach utilises a composition of several methods ranging from pure rule based (as clessified in [3]) multi-dimensional distance geometry method [1] to data based stored substructure lookup features in a flexible software framework. The actual implementation is a highly portable JAVA software (available at [2]), which fits a broad scale of applications: it is used in small web drawing applets as well as standalone database processing component.

The coordinate determination process can be best characterized by the \"divide and conquer\" approach: the structure is composed of fragments, which are joined together. From the available fragment conformers the conformers of the joined structures can be generated during the fusing step. The fragment conformers are generated either through further fragmentation or with an elemental structure/conformer prediction method, consequently the conformational analysis is an inherent part of the building process (in contrast with methods which proceed from 3D initial structures, like [4]). The novelty of our approach lies in the diversity of the utilised elemental methods and the arisen scalability options.


  1. G. Imre, G. Veress, A. Volford and O Farkas, "Molecules from the Minkowski Space: An approach to building 3D molecular structures", J. Mol. Struct. (Theochem), 666-667, 51-59 (2003)
  3. J. Sadowski and J. Gasteiger, "From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders", Chem. Rev., 93, 2567-2581 (1993)
  4. J. Weiser, M. C. Holthausen, L. Fitjer, "HUNTER: A Conformational Search Program for Acyclic to Polycyclic Molecules with Special Emphasis on Stereochemistry", J. Comput. Chem., 18, 1265-1281 (1997)



P-58 : Uses and Potential Uses of Reasoning in Chemoinformatics

Julian Hayward; Lhasa Limited, Leeds, GB
Philip Judson, Lhasa Limited

There are situations in chemoinformatics where the need for reliable, sometimes numerical, information to support algorithms for the quantification or ranking of output causes problems. For example, algorithms may require a “yes” or “no” answer to the question “Is this atom aromatic”; systems for finding structures similar to a query may depend on the processing of numerical measures of similarity in order to rank output.  In connection with our work on the prediction of the toxicity of chemicals and the metabolism of xenobiotic chemicals we have developed reasoning methods tolerant of uncertainty which could be of much wider use in chemoinformatics.



P-60 : Lead Conformers, a Thermodynamics Approach

Adrian Kalaszi; Eotvos Lorand University, Budapest, HU
Odon Farkas, Eotvos Lorand University

Finding drug like compounds is a challenging process in drug discovery. The 3D structure of the binding site of the target protein is often unknown and dealing with the flexibility of the ligand molecules is still problematic. However, flexible ligand molecules can be considered as nanoscale scanning devices adapting to the 3D structure of the active site[1]. The special themodinamic properties of the binding of flexible molecules, as derived here, show that the probability of the binding conformations in solution determines the likelihood of binding. The binding activities, which correlates with binding probability, can be evaluated experimentally, while the probability of conformations in solution can be obtained via molecular dynamics simulations. If both both data for a set of flexible molecules are available, a model to the spatial arrangement of the pharmacophores can be constructed.

  1. A. Kalaszi, O. Farkas Journal of Molecular Structure (Theochem) 666 667 (2003) 645 649



P-62 : Finding Discriminative Substructures Using Elaborate Chemical Representation

Jeroen Kazius; Universiteit Leiden, Leiden, NL
Siegfried Nijssen, Joost Kok, Thomas Bäck, and Ad IJzerman, Universiteit Leiden

In pharmaceutical research, knowledge of molecular substructures relevant to physico-chemical and biological properties can aid in synthesis decision, library design for high throughput screening (HTS), hit prioritisation, lead optimisation and prioritisation of pharmacological or toxicological assays. Therefore, data mining algorithms to determine substructures discriminative of these properties are ever more important. Often however, limited chemical information is considered, such as linear sequences of atom and bond types. Furthermore, the increasingly large amount of available data (e.g. from HTS) requires significant efficiency of such algorithms.
We employed a means of chemical representation in which single atoms can be represented by atomic hierarchies. Any matching SMARTS expression[1] can be used to represent an atom while further expressions can be appended as extra nodes with additional chemical information. For example, a wildcard such as a hydrogen donor label can be supplemented with specifiers for atom type, charge, ring size, number of hydrogens, etc. Molecules are consequently represented as elaborate graphs.

We have developed a novel graph-based data mining system, called GASTON. GASTON makes use of the fact that sequences, free trees and graphs are contained in each other and it efficiently splits up the substructure finding process by finding all sequences, free trees and graphs, respectively. For extra speed, constraints can be applied to, for instance, the maximum size of a substructure, the minimum number of molecules that it needs to detect and/or its maximum p-value for a binary biological of physico-chemical classification. The combination of this representation method and a substructure finding algorithm enables the automated detection of discriminative substructures, which can consist of very general and very specific components.

A final selection step is required in order to extract a small set of discriminative, nonredundant and informative substructures from a dataset of compounds with binary classifications. Therefore, a simple p-value-based greedy selection method was written and employed. As a practical example, several results of the application of the described methods on large datasets for mutagenicity and aqueous solubility are discussed. The outputted substructures do not only provide insight into relevant moieties, they can also be used for predictive purposes. As such, these substructures can directly serve as either structural alerts for mutagenicity / insolubility or as chemical solutions for increasing solubility / decreasing mutagenicity. These promising findings confirm the utility of employing the discussed methods of chemical representation, substructure finding and descriptor selection.

  1. Daylight Chemical Information, Inc., Santa Fe, NM, at



P-64 : scPDB: An Annotated Database of Three-Dimensional Structures of Binding Sites for Drug-Like Molecules

Esther Kellenberger; University of Strasbourg, Illkirch, FR
Guillaume Bret, Pascal Muller, and Didier Rognan, University of Strasbourg

The scPDB is a collection of about 7000 three-dimensional structures of putative binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high resolution crystal structures in which a complex between a protein cavity and a drug-like ligand was detected. Ligands consist of small molecules like nucleotides (≤ 3 mer), peptides (≤ 5 mer), endogeneous ligands and drugs, but not water, metal ions or unwanted molecules (e.g., non specific ligands, solvents, detergents..). The binding site is formed by all protein residues (including amino acids, cofactors and important metal ions) with at least one atom within 6.5 Å of a ligand atom. The scPDB was carefully annotated. Information from PDB entries and corresponding SWISSPROT files were merged in order to assign to every binding sites the following features: protein name, function, source, domain and mutations, ligand name and structure.

The scPDB was designed for docking purposes; the virtual screening of the binding sites database against a given ligand can predict the most likely target for the molecule and also suggest a selectivity profile [1]. It may also be used to analyse the similarity between cavities and to derive rules that describe the relationship between ligand pharmacophoric points and protein site properties.

The scPDB is periodically updated. It is accessible on the web at

  1. Paul, N., Bret, G., Kellenberger, E., Muller, P., Rognan, D. (2004) Recovering the true targets of selective ligands by virtual screening of the Protein Data Bank. Proteins, 54, 671-680.



P-66 : Application of Knowledge-Based Scoring Functions for Virtual Screening

Chrysi Konstantinou Kirtay; Cambridge University, Unilever Centre for Molecular Science Informatics, Cambridge, GB
J.B.O. Mitchell, Cambridge University, Unilever Centre for Molecular Science Informatics
J.A. Lumley, Arrow Therapeutics Ltd

Flexible docking algorithms, applied to virtual compound libraries, are able to predict protein ligand complexes with reasonable accuracy and speed. However the major weakness lies in the functions used for predicting the binding affinity between the receptor and ligand, also known as scoring functions. Scoring functions can be applied during the docking process as fitness functions for the optimization of ligand orientation and conformation, as well as in the post docking comparison of molecules for the estimation of their binding affinity to a specific protein target [1].

We have implemented BLEEP [2, 3], a knowledge-based scoring function based on high resolution (≤2Å) structural data from the Protein Data Bank (PDB), in order to determine how well a candidate docked structure resembles a typical real protein-ligand complex.  Possible protein-ligand interactions are assessed using their atom-atom distance distributions which are converted into pseudo-energy functions by the implementation of Boltzmann hypothesis. BLEEP generates good (Rs  0.6) correlations with experimental binding energies for diverse sets of non-covalent complexes and substantially better correlations (Rs  0.75) for a series of related complexes [4].

In a recent study we generated an updated version of the BLEEP statistical potential, using a dataset of 196 complexes, performing similarly to the existing BLEEP. An algorithm was implemented to allow the automatic calculation of bond orders, and hence of the appropriate numbers of hydrogen atoms present. A potential specific to strongly and weakly bound complexes was generated. However there was no further improvement to the prediction of binding affinities.  In addition, we also investigated the range of binding energies found as a function of either ligand molecular weight or number of heavy atoms and we derived some simple functions describing this behaviour. Our research is currently focusing on the application of BLEEP to a number of virtual screening case examples utilising different combinations of docking/scoring functions (GOLD 2.2 & Silver, FlexX).

  1. Stahl M and Rarey M. Detailed Analysis of Scoring Functions for Virtual Screening. Journal of Medicinal Chemistry 2001;44:1035-1042.
  2. Mitchell JBO, Laskowski RA, Alex A, Forster MJ, and Thornton JM. BLEEP - A potential of Mean Force Describing Protein-Ligand Interactions II. Calculation of Binding Energies and Comparison with Experimental Data. Journal of Computational Chemistry 1999;20:1177-1185.
  3. Mitchell JBO, Laskowski RA, Alex A and Thornton JM. BLEEP - A potential of Mean Force Describing Protein-Ligand Interactions. I. Generating the Potential. Journal of Computational Chemistry 1999;20:1165-1176.
  4. Marsden PM, Puvanendrampillai D, Mitchell JBO and Glen RC. Predicting protein ligand binding affinities: a low scoring game ? Organic Biomolecular Chemistry 2004;2:3267-3273.



P-68 : Selecting Potential Active Compounds by Matching Biological Profiles of Compounds with Known and Unknown Activities

Alexander Kos; AKos Consulting & Solutions GmbH, Riehen, CH
Dusan Toman, DIMENSION 5, Ltd.
Vladimir V. Poroikov, Institute Biomedical Chemistry of Rus. Acad. Med. Sci.
Ulrich Jordis, 4Technische Universität Wien
Timo Knuuttila, Visipoint Oy

In silico methods appear to have more merit in the early phase of compound selection than high throughput screening experiments. Whereas the first screening experiments give physical binding parameters for compounds, one does not learn anything about possible side effects or about the possible metabolic reactions of a compound. PASS (Prediction of Activity Spectra of Substances) is one of the first programs able to develop reliable 1000 biological parameters for compounds from its 2D structure. The biological parameter is the measurement as percentage that a compound has a certain biological activity, like being an acetylcholine-esterase inhibitor (AChEI). Every drug has many actions, known as side effects. A biological profile is the set of activities that a drug should have, and a set of activities that a drug should not have, i.e. being carcinogenic. Using PharmaExpert we can develop a biological profile for a desired class of compounds by analysing the statistics of the PASS parameters. Using clustering software we match the biological profiles of compounds with known and unknown activities. Furthermore, a graphic representation tells us which functional groups or atoms of a compound are used in deriving a prediction. PASS Metabolite is predicts metabolic reactions. PASS, PharmaExpert and PASS Metabolite are very effective, but only some of a large number of useful in silico methods. PASS CL is a command line version that can be incorporated in ones own application, or components are available for programs like Pipeline Pilot.



P-70 : Evaluation of the Diversity of Screening Libraries

Mireille Krier; CNRS, Illkirch, FR
Guillaume Bret and Didier Rognan, CNRS

High-throughput screening nowadays requires compound libraries in which the maximal chemical diversity is reached with the minimal number of molecules. Medicinal chemists have traditionally realized assessment of diversity and subsequent compound acquisition although a recent study suggests that experts are usually inconsistent in reviewing large datasets[1]. In order to analyze the chemical diversity of commercially available screening collections, we have developed a general workflow aimed at (1) identifying drug-like compounds, (2) cluster them by common substructures (scaffolds) using a phylogenetic-like tree growing algorithm[2], (3) measure the scaffold diversity encoded by each screening collection independently of its size, and finally (4) merge all common substructures in a non-redundant scaffold library that can easily browsed by structural and topological queries. Starting from 2.4 compounds described by 12 commercial sources, three categories of libraries could be identified: combi-chem libraries (low scaffold diversity, large size), screening libraries (medium diversity, medium size) and diverse libraries (high diversity, low size). The chemical space enclosed in the scaffold library can be easily searched to prioritize scaffold-focused libraries.

  1. Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V. Assessment of the consistency of medicinal chemists in reviewing sets of compounds. J Med Chem 2004, 47, 4891-4896.
  2. Nicolaou, C. A.; Tamura, S. Y.; Kelley, B. P.; Bassett, S. I.; Nutt, R. F. Analysis of large screening data sets via adaptively grown phylogenetic-like trees. J Chem Inf Comput Sci 2002, 42, 1069-1079.



P-72 : Estimation of Environmental Compartment Half-Lives from Structural Similarity

Ralph Kühne; UFZ Centre for Environmental Research, Leipzig, DE
Ralf-Uwe Ebert and Gerrit Schüürmann, UFZ Centre for Environmental Research

Environmental fate modelling requires knowledge of the total degradation rates in environmental compartments. However, the availability of respective data is rather limited. Compartment half-lives from literature have been compiled in a database. A k nearest neighbours approach is applied to obtain a prediction for a new compound from this database.

The similarity measure to select the nearest neighbours is an approach based on atom centred fragments. First, the molecule skeleton is separated in its individual atoms. Then, atom centred fragments are built from the atoms by consideration of the neighbour atoms and several atom properties as e.g. aromaticity. Last, the selection of the most similar compounds in the database for a test chemical is achieved by comparison of the atom centred fragments.

The results of this method are compared to a few existing methods for degradation rates of individual processes, and the reliability is tested by statistical means. The impact of the prediction uncertainty to environmental fate modelling is examined.



P-74 : Water Solubility Prediction - Model Selection Based on Structural Similarity

Ralph Kühne; UFZ Centre for Environmental Research, Leipzig, DE
Ralf-Uwe Ebert and Gerrit Schüürmann, UFZ Centre for Environmental Research

To predict water solubility for organic compounds from chemical structure, a number of quite accurate estimation methods have been published. These models differ in their performance for individual compounds and compound classes. To select the optimum method for a particular chemical, a k nearest neighbours approach is suggested. For a large training set of validated experimental water solubility data, seven estimation methods have been applied. Concerning the respective method errors for the individual compounds, a scoring system for these models has been stored in a database. To select a model for prediction, the nearest neighbours are looked for in this database by a similarity approach based on atom centered fragments. The model with the best average score for these compounds is selected.

The efficiency of this approach is demonstrated by an external test set, compared to the individual models, and proved by statistical means.



P-76 : The Inconsistency of Medicinal Chemists in Reviewing Sets of Compounds

Mic Lajiness; Eli Lilly & Company, Indianapolis, IN, US

Medicinal chemists are frequently asked to review lists of compounds to assess their drug or lead-like nature, and to evaluate the suitability of lead compounds based on their attractiveness; and/or synthetic. Presumably, one medicinal chemist’s opinion is as good as any others -- but is it? In an attempt to answer this question an experiment was performed in conjunction with a compound acquisition program conducted at Pharmacia.  Historically, this involved a review of many thousands of compounds by medicinal chemists who eliminate anything deemed undesirable for any reason.  In the present experiment, 22,000 compounds requiring review by medicinal chemists were broken down into eleven lists of approximately 2,000 compounds each.   Unbeknownst to the medicinal chemists a subset of 250 compounds, previously rejected by a very experienced senior medicinal chemist, was added to each of the lists.  Most of the thirteen medicinal chemists who participated in this process reviewed two lists, although some only reviewed a single list and one reviewed three lists.  Those compounds that were deemed unacceptable were recorded and tabulated in various ways to assess the consistency of the reviews.  It was found that medicinal chemists were not very consistent in the compounds they rejected as being undesirable. This has important implications for pharmaceutical project teams where individual medicinal chemists review lists of primary screening hits to identify those compounds suitable for follow-up. Once a compound is removed from a list it and other structurally-similar compounds are effectively removed from further consideration. This can also have an impact on computational chemists who are developing models for assessing the desirability or attractiveness of different classes of compounds for lead discovery.



P-78 : Chemical Clichés: Treasure Hidden by Obviousness?

Eric-Wubbo Lameijer; Universiteit Leiden, Leiden, NL
Thomas Bäck, Joost Kok, Ad P. Ijzerman, and Sara van de Geer, Universiteit Leiden

Molecules can be considered to be collections of atoms connected by bonds. The arrangement of atoms and bonds is however far from arbitrary. Synthetic and medicinal chemists commonly think in “groups”, which are molecular fragments common to many molecules. Examples are the phenyl group, the methyl group and the keto group.

While most chemists know dozens of these groups by heart and hundreds by recognition, we decided to get a much more elaborate set of fragments by mining chemical databases. This has the advantages that the number of fragments found will exceed the number of fragments known to any individual chemist, and that the occurrence and therefore the relative importance of the different fragments can be determined more quantitatively.

In this project we mined the public database of the American National Cancer Institute, containing about 250.000 compounds. The molecules were split into fragments by breaking the bonds connecting the rings and the non-ring parts of the molecule. An example is shown in figure 1.

Figure 1: Decomposing folic acid into fragments

This splitting process resulted in 13574 different ring fragments and 10661 different non-ring fragments, which were sorted by occurrence and stored in separate databases. Frequencies of the different fragments varied widely, ranging from many thousands (phenyl) to one-time occurrences of the rarest groups.

We subsequently performed a co-occurrence analysis to see whether the distribution of the fragments in the molecules is mainly random or whether there are groups which co-occur uncommonly often, real “chemical clichés”. We found that there were many combinations which occurred much more often than expected by chance. Most of these were known biologically active compounds or chemical substructures which were easy to combine synthetically.

These lists of clichés may extend the choice a medicinal chemist has when modifying a compound. The groups at the top of the list can be used for the initial stages of lead optimization, to give a greater chance that the derivatives will be easy to synthesize. Alternatively, the investigator can consciously avoid clichés by looking lower on the list for suitable groups which are rarer, but for which synthetic information is available nevertheless. So knowledge of clichés could lead to both synthesizable and novel compounds.



P-80 : 1H NMR – Based Classification of Photochemical Reactions

Diogo Latino; Universidade Nova de Lisboa, Caparica, PT
Filomena F. M. Freitas, Fernando M. S. Silva Fernandes, and João Aires-de-Sousa, Universidade Nova de Lisboa

Automatic analysis of changes in the 1H NMR spectrum of a mixture, and their interpretation in terms of chemical reactions taking place, has a diversity of possible applications. For example the changes in the 11H NMR spectrum of a stored chemical can be interpreted in terms of the chemical reactions responsible for degradation. Or the alterations in the spectrum of a biofluid can be related to changes in metabolic reactions.

The SPINUS program previously developed (1-3) for the estimation of 1H NMR chemical shifts from the molecular structure allows linking a database of chemical reactions to the corresponding 1H NMR data.

Clearly 1H NMR spectroscopy has its own limitations, particularly for reactions with a small number of hydrogen atoms in the neighbourhood of the reaction centre. Having this in mind, the use of 1H NMR has considerable advantages comparing to NMR of other nuclei, for example in terms of speed and amount of required sample.

Here we demonstrate the classification of photochemical reactions by Kohonen self-organizing maps (SOMs) taking as input the difference between the 1H NMR spectra of the products and the reactants. We used a data set of 189 chemical reactions, each reaction having been manually assigned to one of 7 classes –  [3+2] photocycloaddition of azirines to C=C, [2+2] photocycloaddition of triazinones to C=C, [3+2] photocycloaddition of pyridazines to C=C, [4+2] photocycloaddition of C=C to C=C (photo-Diels-Alder reactions), [2+2] photocycloaddition of C=C to C=O, [2+2] photocycloaddition of C=C to C=C, and [2+2] photocycloaddition of C=C to C=S. The chemical shifts of reactants and products were generated by SPINUS and were fuzzified to obtain a crude representation of the spectrum. All the signals arising from all the reactants of one reaction were taken together (a simulated spectrum of the mixture of reactants) and the same would hold for products (although in this work we only considered reactions yielding a single molecule as the product). The simulated spectrum of the reactants is subtracted from the spectrum of the products and the difference spectrum is taken as the representation of the chemical reaction – the input to the neural networks.

Kohonen neural networks were trained with a set of 147 reactions and then tested with the remaining 42 reactions. The test set was randomly selected to cover the whole space of reactions. A reasonable clustering of the reactions by reaction type was observed. An ensemble of five networks was considered, predictions being obtained by majority voting and associated with a prediction score. Correct predictions could be achieved for more than 90% of the training set and for 75-90% of the test set. The prediction score gave a robust indication of the reliability of the prediction.

The results support our proposal of linking reaction and NMR data for automatic reaction classification.


    D.A.R.S. Latino acknowledges Fundação para a Ciência e Tecnologia for financial support under a PhD grant (SFRH/BD/18347). The authors thank InfoChem GmbH (Munich, Germany) for sharing the dataset of photochemical reactions.


  1. Y. Binev, J. Aires-de-Sousa, "Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3), 940-945.
  2. Y. Binev, M. Corvo, J. Aires-de-Sousa, "The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks", J. Chem. Inf. Comp. Sci., 2004, 44(3), 946-949.
  3. SPINUS can be accessed at: or



P-82 : SOMA – Computational Molecular Discovery Environment

Pekka Lehtovuori; CSC - Scientific Computing Ltd, Espoo, FI
Tommi Nyrönen, CSC - Scientific Computing Ltd

We are developing a computational molecular discovery environment for the Finnish universities and research institutions. The goal is to build an integrated software environment and enhance the usability of the software in the supercomputing grid of CSC – the Finnish Information Technology Center for Science. CSC's services are based on a versatile supercomputing environment, fast data communications connections and on expertise in different scientific disciplines and information technology. CSC offers a large selection of computational tools (e.g. bioinformatics, chemoinformatics, databases, quantum chemistry, QSAR, molecular dynamics).

Transferring data from one program to another is a nerve-wracking step, simply because programs are seldom designed to work together. SOMA consists of scientific applications, linked together in a www-interface to form a user-friendly computing and data management environment.

The environment currently consists of 1. extended markup language (XML) description of the scientific programs (program-XML), 2. template scripts for the configuration files of programs and possible batch job system, 3. program piper, which moves the results from one program to the input of the next one and reports the state of each step to the job status database, 4. tools for internal data format conversion 5. tools, which build up a www-interface based on the program-xml and 6. tools, which presents the results in a web-browser.

The key function of the SOMA environment is to improve the data flow between programs in a supercomputing environment. The users can search and build small molecules in silico and use structural, physicochemical and biological information on the target macromolecule with less technical obstacles than before.



P-84 : Characterization and Clustering of Reagents for Combinatorial Library Design from the Products’ Perspectives

Uta Lessel; Boehringer Ingelheim Pharma GmbH, Biberach, DE

Reagent-biased product based design was published by Pearlman and Smith in 1999. The method is available with Diverse Solutions and works well e.g. to select a set of reagents leading to a combinatorial library with products broadly distributed in a BCUT property space. If the BCUT property space is divided in cells the products of the designed library occupy an optimized number of cells. But a new design cycle has to be started, if one or more of the selected reagents are exchanged. Manual exchange of reagents by the combi chemists usually leads to a clear loss of cell occupation shown by the final library. For this purpose a new method for reagent selection was created based on the original idea of Pearlman.

Each reagent is characterized by a fingerprint encoding the cells which are occupied by its products. In the next step the reagents can be clustered according to their fingerprints. This way reagents group together that lead to similar products. Additionally, reagents can be prioritized e.g. according to the number of cells occupied by their products, by the number of products obeying Lipinski’s rules, etc.

In the presentation the similarities of reagents based on the cell occupancy of their products will be discussed in more detail and it will be illustrated by some design examples how the method can be implemented in Combinatorial Library Design.



P-86 : Using Chemistry Fragment Data Mining and Neural Networks to Perform High Throughput Environmental Assessment of Chemicals: The AISEES Approach

Mark Lewis; Environment Canada, Gatineau, CA
Stephan P. Niculescu, Drew MacDonald, Andy Atkinson, Greg Hammond, and David Morin, Environment Canada

Each year, Environment Canada, prepares environmental risk assessments on chemicals regulatory decision making purposes. It is very common that experimental environmental toxicological or environmental fate information is not available, and various commercial QSAR based predictive software is used instead. Unfortunately, most of the models implemented in such software are subject to very serious limitations. To overcome such limitations and answer its immediate needs, Environment Canada initiated its own research that has lead to the development of its own Artificial Intelligence Substance Evaluation Expert System (AISEES). It combines various state-of-the-art technologies such as chemistry fragment data mining (to extract necessary information on atoms and fragments of interest from chemical structure) and probabilistic neural networks QSAR/QSPRs (to generate predictions). The poster will present details on the AISEES system and three of the models already implemented inside: two QSAR models for the prediction of the acute toxicity to fathead minnows, and one QSPR model targeting prediction of aqueous solubility. To allow for scientific interpretation and appropriate decisions to be rationalized, all predictions are supported by confidence intervals and an in-depth similarity analysis based on the Tanimoto similarity coefficient. Based on the data available from the thousands of substances notified to Environment Canada’s New and Existing Substances branches over the last several years, work is underway to generate new such models targeting other toxicity endpoints, and physical/chemical or fate properties, and then implement them as part of this system.



Last updated 17 April, 2005

[Home] [General Information] [Corporate Sponsors] [Technical Program] [Accommodations] [Call For Papers] [Abstract Submission] [Registration] [Student Bursaries] [Exhibition] [Society Sponsors] [Advisory Board] [Previous Meetings] [Contact Us]