Combinatorial Chemistry & High Throughput Screening

ISSN: 1386-2073

Combinatorial Chemistry & High Throughput Screening
Volume 12, Number 5, June 2009


Contents


Machine Learning for Virtual Screening (Part 2)
Guest Editor: Ovidiu Ivanciuc


Editorial
Pp. 451-452


How Wrong Can We Get? A Review of Machine Learning Approaches and Error Bars
Pp. 453-468
Anton Schwaighofer, Timon Schroeter, Sebastian Mika and Gilles Blanchard
[Abstract] [Purchase Article] [PMID: 19519325 PubMed - indexed for MEDLINE]


Bayesian Modeling in Virtual High Throughput Screening Pp. 469-483
Anthony E. Klon
[Abstract] [Purchase Article] [PMID: 19519326 PubMed - indexed for MEDLINE]


Virtual High Throughput Screening Using Combined Random Forest and Flexible Docking Pp. 484-489
Dariusz Plewczynski, Marcin von Grotthuss, Leszek Rychlewski and Krzysztof Ginalski
[Abstract] [Purchase Article] [PMID: 19519327 PubMed - indexed for MEDLINE]


The Applications of Machine Learning Algorithms in the Modeling of Estrogen-Like Chemicals Pp. 490-496
Huanxiang Liu, Xiaojun Yao and Paola Gramatica
[Abstract] [Purchase Article] [PMID: 19519328 PubMed - indexed for MEDLINE]


Recent Developments of In Silico Predictions of Intestinal Absorption and Oral Bioavailability Pp. 497-506
Tingjun Hou, Youyong Li, Wei Zhang and Junmei Wang
[Abstract] [Purchase Article] [PMID: 19519329 PubMed - indexed for MEDLINE]


Feature Selection and Classification Employing Hybrid Ant Colony Optimization/Random Forest Methodology Pp. 507-513
Diwakar Patil, Rahul Raj, Prashant Shingade, Bhaskar Kulkarni and Valadi K. Jayaraman
[Abstract] [Purchase Article] [PMID: 19519330 PubMed - indexed for MEDLINE]


Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides Pp. 514-519
Loren Hansen, Ernestine A. Lee, Kevin Hestir, Lewis T. Williams and David Farrelly
[Abstract] [Purchase Article] [PMID: 19519331 PubMed - indexed for MEDLINE]


Meet the Guest Editor Pp. 520


General Articles


Profiling Human Saliva Endogenous Peptidome via a High Throughput MALDI-TOF-TOF Mass Spectrometry
Pp. 521-531
Chun-Ming Huang and Wenhong Zhu
[Abstract] [Purchase Article] [PMID: 19519332 PubMed - indexed for MEDLINE]


High Throughput Heme Assay by Detection of Chemiluminescence of Reconstituted Horseradish Peroxidase Pp. 532-535
Shigekazu Takahashi and Tatsuru Masuda
[Abstract] [Purchase Article] [PMID: 19519333 PubMed - indexed for MEDLINE]


Multicomponent One-Pot Reactions: Synthesis of Some New 6 Oxopyrano [2,3-c]Isochromenes by Condensation of Homophthalic Anhydride, Dialkyl Acetylenedicarboxylate, and Isocyanides Pp. 536-542
Ali A. Mohammadi, Roya Akbarzadeh and Hamed Rouhi
[Abstract] [Purchase Article] [PMID: 19519334 PubMed - indexed for MEDLINE]




Abstracts

[Back to top]

Editorial: Machine Learning for Virtual Screening (Part 2)

Note from the CCHTS Editor

This is the second of a two-part series on Machine Learning for Virtual Screening. The Introduction that was included in the first part of this special issue, written by the guest editor Ovidiu Ivanciuc, is reprinted here.

Computer-assisted drug design is used to increase the chances of finding valuable drug candidates, by applying a wide range of computational methods, such as machine learning, structure-activity relationships, quantitative structure-activity relationships, molecular mechanics, quantum mechanics, molecular dynamics, and drug-protein docking. Machine learning is an important field of artificial intelligence, and includes a diversity of methods and algorithms that extract rules and functions from large datasets. The most important algorithms are linear discriminant analysis, artificial neural networks, decision trees, lazy learning, k-nearest neighbors, Bayesian methods, Gaussian processes, support vector machines, and kernel algorithms. This special issue presents a representative selection of machine learning applications for the virtual screening of chemical libraries.

Machine learning is a rich and dynamic field, with new methods proposed constantly, which makes difficult to estimate the quality of predictions expected from a particular algorithm. Schwaighofer et al. explore the theoretical and practical aspects of estimating the confidence (error bars) of predictions obtained with quantitative structure-activity relationships based on three prevalent nonlinear regression methods, namely support vector regression, Gaussian processes, and decision trees. This practical aspect of estimating biological activities is currently overlooked in many structure-activity models, but the algorithms presented in this paper demonstrate an efficient approach in computing confidence levels for activity predictions.

Naïve Bayesian classifiers are robust and efficient algorithms for the rapid virtual screening of large compound libraries. Klon presents a substantial and comprehensive review of Bayesian classifiers that are currently used in drug design and discovery. Bayesian models have consistently been shown to be tolerant of noisy training data, often outperforming more elaborated machine learning algorithms, and may provide reliable predictions even when trained with limited amounts of experimental data. Alternatively, Bayesian classifiers have been used as an effective post-processing technique to integrate sets of predictions obtained with other machine learning methods.

Ligand-protein docking is an effective approach in selecting promising inhibitors, but its main drawback is the large computation time necessary to screen large chemical libraries. Plewczynski et al. propose a hybrid method in which a fast machine learning algorithm, random forest, is coupled with ligand-protein docking to obtain a virtual screening procedure that demonstrates in practical applications both speed and reliable predictions. The random forest machine learning is trained with predictions obtained from ligand-protein docking and scoring, and thus the virtual screening procedure may be applied even when trained only with limited number of experimental data.

Endocrine disrupting chemicals are adversely affecting human and wildlife health through a variety of mechanisms, mainly estrogen receptor-mediated mechanisms of toxicity. Liu, Yao, and Gramatica present a broad overview of classification and regression applications of machine learning in modeling of estrogen-like chemicals. The comparative analysis of published models shows that linear models are fast and easy to apply, but nonlinear approaches, such as artificial neural networks and support vector machines, provide better predictions. Effective models to identify possible estrogens are valuable tools for government regulators and for the chemical industry.

Among the absorption, distribution, metabolism, elimination, and toxicity properties (ADMET), unfavorable oral bioavailability is a major cause in rejecting drug candidates. Hou et al. review the most important machine learning models for the prediction of passive intestinal absorption and oral bioavailability. The article also compares a traditional classification method (recursive partitioning) with a more recent addition to the machine learning algorithms (support vector machines), showing that support vector machines give better predictions. The influence of other training parameters, such as dataset size, is also investigated.

Ant colony optimization is a metaheuristic algorithm proposed by Marco Dorigo in 1992, and inspired by the behavior of ants seeking a path between their colony and a source of food. Kulkarni, Jayaraman and co-workers propose an original classification algorithm that combines ant colony optimization with random forest, thus exploring the search space to select a feature subset with high prediction ability. The novel method was tested with success for predicting peptides binding affinity for major histocompatibility complex (MHC) class I molecules.

Structural descriptor selection or feature selection is an important component in developing a predictive machine learning model for virtual screening. Hansen, Farrelly and co-workers describe the GenSelect procedure that performs feature selection in random forests of decision trees with a genetic algorithm. A genetic algorithm is a global optimization technique inspired by the biological processes of mutation, selection, crossover, and inheritance. GenSelect was evaluated for several problems proposed for the Comparative Evaluation of Prediction Algorithms (CoEPrA, http://www.coepra.org) competition.


Ovidiu Ivanciuc

(Guest Editor)
Department of Biochemistry and Molecular Biology
University of Texas Medical Branch
301 University Boulevard
Galveston
TX 77555-0857
USA
E-mail: ivanciuc@gmail.com


[Back to top] [Purchase Article] [PMID: 19519325 PubMed - indexed for MEDLINE]
How Wrong Can We Get? A Review of Machine Learning Approaches and Error Bars
Anton Schwaighofer, Timon Schroeter, Sebastian Mika and Gilles Blanchard

A large number of different machine learning methods can potentially be used for ligand-based virtual screening. In our contribution, we focus on three specific nonlinear methods, namely support vector regression, Gaussian process models, and decision trees. For each of these methods, we provide a short and intuitive introduction. In particular, we will also discuss how confidence estimates (error bars) can be obtained from these methods. We continue with important aspects for model building and evaluation, such as methodologies for model selection, evaluation, performance criteria, and how the quality of error bar estimates can be verified. Besides an introduction to the respective methods, we will also point to available implementations, and discuss important issues for the practical application.


[Back to top] [Purchase Article] [PMID: 19519326 PubMed - indexed for MEDLINE]
Bayesian Modeling in Virtual High Throughput Screening
Anthony E. Klon

Naïve Bayesian classifiers are a relatively recent addition to the arsenal of tools available to computational chemists. These classifiers fall into a class of algorithms referred to broadly as machine learning algorithms. Bayesian classifiers may be used in conjunction with classical modeling techniques to assist in the rapid virtual screening of large compound libraries in a systematic manner with a minimum of human intervention. This approach allows computational scientists to concentrate their efforts on their core strengths of model building. Bayesian classifiers have an added advantage of being able to handle a variety of numerical or binary data such as physicochemical properties or molecular fingerprints, making the addition of new parameters to existing models a relatively straightforward process. As a result, during a drug discovery project these classifiers can better evolve with the needs of the projects from general models in the lead finding stages to increasingly precise models in the lead optimization stages that are of particular interest to a specific medicinal chemistry team. Although other machine learning algorithms abound, Bayesian classifiers have been shown to compare favorably under most working conditions and have been shown to be tolerant of noisy experimental data.


[Back to top] [Purchase Article] [PMID: 19519327 PubMed - indexed for MEDLINE]
Virtual High Throughput Screening Using Combined Random Forest and Flexible Docking
Dariusz Plewczynski, Marcin von Grotthuss, Leszek Rychlewski and Krzysztof Ginalski

We present here the random forest supervised machine learning algorithm applied to flexible docking results from five typical virtual high throughput screening (HTS) studies. Our approach is aimed at: i) reducing the number of compounds to be tested experimentally against the given protein target and ii) extending results of flexible docking experiments performed only on a subset of a chemical library in order to select promising inhibitors from the whole dataset. The random forest (RF) method is applied and tested here on compounds from the MDL drug data report (MDDR). The recall values for selected five diverse protein targets are over 90% and the performance reaches 100%. This machine learning method combined with flexible docking is capable to find 60% of the active compounds for most protein targets by docking only 10% of screened ligands. Therefore our in silico approach is able to scan very large databases rapidly in order to predict biological activity of small molecule inhibitors and provides an effective alternative for more computationally demanding methods in virtual HTS.


[Back to top] [Purchase Article] [PMID: 19519328 PubMed - indexed for MEDLINE]
The Applications of Machine Learning Algorithms in the Modeling of Estrogen-Like Chemicals
Huanxiang Liu, Xiaojun Yao and Paola Gramatica

Increasing concern is being shown by the scientific community, government regulators, and the public about endocrine-disrupting chemicals that, in the environment, are adversely affecting human and wildlife health through a variety of mechanisms, mainly estrogen receptor-mediated mechanisms of toxicity. Because of the large number of such chemicals in the environment, there is a great need for an effective means of rapidly assessing endocrine-disrupting activity in the toxicology assessment process. When faced with the challenging task of screening large libraries of molecules for biological activity, the benefits of computational predictive models based on quantitative structure-activity relationships to identify possible estrogens become immediately obvious. Recently, in order to improve the accuracy of prediction, some machine learning techniques were introduced to build more effective predictive models. In this review we will focus our attention on some recent advances in the use of these methods in modeling estrogen-like chemicals. The advantages and disadvantages of the machine learning algorithms used in solving this problem, the importance of the validation and performance assessment of the built models as well as their applicability domains will be discussed.


[Back to top] [Purchase Article] [PMID: 19519329 PubMed - indexed for MEDLINE]
Recent Developments of In Silico Predictions of Intestinal Absorption and Oral Bioavailability
Tingjun Hou, Youyong Li, Wei Zhang and Junmei Wang

Among the absorption, distribution, metabolism, elimination, and toxicity properties (ADMET), unfavorable oral bioavailability is indeed an important reason for stopping further development of the drug candidates. Thus, predictions of oral bioavailability and bioavailability-related properties, especially intestinal absorption are areas in need of progress to aid pharmaceutical drug development. In this article, we review recent developments in the prediction of passive intestinal absorption and oral bioavailability. The advances in the datasets used for model building, the molecular descriptors, the prediction models, and the statistical modeling techniques, are summarized. Furthermore, we compared the performance of one machine learning method, support vector machines (SVM), and one traditional classification method, recursive partitioning (RP), on the predictions of passive absorption. Our comparisons demonstrate that the complex machine learning method could give better predictions than the traditional approach. Finally we discuss the current challenges that remain to be addressed.


[Back to top] [Purchase Article] [PMID: 19519330 PubMed - indexed for MEDLINE]
Feature Selection and Classification Employing Hybrid Ant Colony Optimization/Random Forest Methodology
Diwakar Patil, Rahul Raj, Prashant Shingade, Bhaskar Kulkarni and Valadi K. Jayaraman

Accurate classification of instances depends on identification and removal of redundant features. Classification of data having high dimensionality is usually performed in conjunction with an appropriate feature selection method. Feature selection enables identification of the most informative feature subset from the enormously vast search space that can accurately classify the given data. We propose an ant colony optimization (ACO)/random forest based hybrid filter-wrapper search technique, which traverses the search space and selects a feature subset with high classifying ability. We evaluate the performance of our algorithm on four widely studied CoEPrA (Comparative Evaluation of Prediction Algorithms, http://coepra.org) datasets. The performance of the software ants mediated hybrid filter/wrapper approach compares well with the available competition results. Thus, the proposed Ant Colony Optimization based technique can effectively find small feature subsets capable of classifying with a very good accuracy and can be employed for feature subset selection with a high level of confidence.


[Back to top] [Purchase Article] [PMID: 19519331 PubMed - indexed for MEDLINE]
Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides
Loren Hansen, Ernestine A. Lee, Kevin Hestir, Lewis T. Williams and David Farrelly

Feature selection is an important challenge in many classification problems, especially if the number of features greatly exceeds the number of examples available. We have developed a procedure - GenForest - which controls feature selection in random forests of decision trees by using a genetic algorithm. This approach was tested through our entry into the Comparative Evaluation of Prediction Algorithms 2006 (CoEPrA) competition (accessible online at: http://www.coepra.org). CoEPrA was a modeling competition organized to provide an objective testing for various classification and regression algorithms via the process of blind prediction. In the competition GenForest ranked 10/23, 5/16 and 9/16 on CoEPrA classification problems 1, 3 and 4, respectively, which involved the classification of type I MHC nonapeptides i.e. peptides containing nine amino acids. These problems each involved the classification of different sets of nonapeptides. Associated with each amino acid was a set of 643 features for a total of 5787 features per peptide. The method, its application to the CoEPrA datasets, and its performance in the competition are described.


[Back to top] [Purchase Article] [PMID: 19519332 PubMed - indexed for MEDLINE]
Profiling Human Saliva Endogenous Peptidome via a High Throughput MALDI-TOF-TOF Mass Spectrometry
Chun-Ming Huang and Wenhong Zhu

Establishment of a saliva protein/peptide signature will provide important information for clinical diagnostics and prognosis of human disease. We digested human whole saliva with trypsin to create a tryptic digest salivary peptidome. Proteins/peptides were subsequently identified by high throughput tandem mass spectrometry in conjunction with database searching. Sixty-three saliva peptides corresponding to twenty-two saliva proteins were identified. Thirty of sixty-three saliva peptides with non-specific tryptic cleavage sites were derived from proline-rich proteins, mucin 7, statherin and collagen. Several peptides derived from proline-rich proteins exhibit proline (Pro) - glutamine (Gln) C-termini (-PQ C-termini). Seven peptides with -PQ C-termini were identified in undigested whole saliva, suggesting that peptides with -PQ C-termini indigenously exist in human saliva. Peptides with -PQ C-termini are known to bind oral bacteria and exhibit properties characteristic of innate-immunity peptides. Thus, a saliva peptidome containing peptides with -PQ C-termini, as presented here, may reinforce the development of innate-immunity-related disease monitoring using non-invasive saliva samples and mass spectrometry-based techniques.


[Back to top] [Purchase Article] [PMID: 19519333 PubMed - indexed for MEDLINE]
High Throughput Heme Assay by Detection of Chemiluminescence of Reconstituted Horseradish Peroxidase

Shigekazu Takahashi
and Tatsuru Masuda

In living organisms, heme is an essential molecule for various biological functions. Recent studies also suggest that heme functions as organelle-derived signal that regulates fundamental cell processes. Furthermore, estimation of heme is widely used for studying various blood disorders. In this regard, development of a rapid, sensitive, and high throughput heme assay has been sought. The most frequently used method of measuring heme by pyridine hemochrome is time, labor, and material intensive, and therefore limiting in its utility for large scale, high throughput analysis. Recently, we reported alternative method that is sensitive and specific to heme, which is based on the ability of horseradish peroxidase (HRP) apo-enzyme to reconstitute with heme to form an active holo-enzyme. Here, we developed high throughput heme assay by performing reactions on multi-well plate with highly sensitive chemiluminescence detection reagents. Detection of chemiluminescence in charged coupled device (CCD)-based gel doc apparatus enables simultaneous measurement of multiple samples. Furthermore, the high sensitivity of this assay allowed a direct measurement of heme in solvent extracts after dilution. This assay is sensitive, quick, provides a large dynamic range, and is well suited for large-scale analysis of heme extracted from minute amount of samples.


[Back to top] [Purchase Article] [PMID: 19519334 PubMed - indexed for MEDLINE]
Multicomponent One-Pot Reactions: Synthesis of Some New 6 Oxopyrano [2,3-c]Isochromenes by Condensation of Homophthalic Anhydride, Dialkyl Acetylenedicarboxylate, and Isocyanides
Ali A. Mohammadi, Roya Akbarzadeh and Hamed Rouhi

A novel three-component, one-pot condensation of the zwitterion generated from dialkyl acetylenedicarboxylate and isocyanides with homophthalic anhydride is described. The reaction affords new 6-oxopyrano[2,3-c]isochromenes in good yield. Isochromenes have been reported to possess diverse biological activities such as antibacterial, antifungal, antiinflammatory, and antiangiogenic effects. Moreover, Theses important compounds are found in various natural products.




Copyright © Bentham Science Publishers Ltd    Terms and Conditions
toptop