Combinatorial
Chemistry & High Throughput Screening
ISSN: 1386-2073

Combinatorial Chemistry &
High Throughput Screening
Volume 12, Number 5, June 2009
Contents
Machine Learning for Virtual Screening (Part 2)
Guest Editor: Ovidiu Ivanciuc
Editorial Pp. 451-452
How Wrong Can We Get? A Review of Machine Learning Approaches
and Error Bars Pp. 453-468
Anton Schwaighofer, Timon Schroeter, Sebastian
Mika and Gilles Blanchard
[Abstract]
[Purchase
Article] [PMID:
19519325 PubMed - indexed for MEDLINE]
Bayesian Modeling in Virtual High Throughput
Screening Pp. 469-483
Anthony E. Klon
[Abstract]
[Purchase
Article] [PMID:
19519326 PubMed - indexed for MEDLINE]
Virtual High Throughput Screening Using
Combined Random Forest and Flexible Docking Pp. 484-489
Dariusz Plewczynski, Marcin von Grotthuss, Leszek
Rychlewski and Krzysztof Ginalski
[Abstract]
[Purchase
Article] [PMID:
19519327 PubMed - indexed for MEDLINE]
The Applications of Machine Learning
Algorithms in the Modeling of Estrogen-Like Chemicals Pp.
490-496
Huanxiang Liu, Xiaojun Yao and Paola
Gramatica
[Abstract]
[Purchase
Article] [PMID:
19519328 PubMed - indexed for MEDLINE]
Recent Developments of In Silico
Predictions of Intestinal Absorption and Oral Bioavailability
Pp. 497-506
Tingjun Hou, Youyong Li, Wei Zhang and
Junmei Wang
[Abstract]
[Purchase
Article] [PMID:
19519329 PubMed - indexed for MEDLINE]
Feature Selection and Classification
Employing Hybrid Ant Colony Optimization/Random Forest Methodology
Pp. 507-513
Diwakar Patil, Rahul Raj, Prashant Shingade,
Bhaskar Kulkarni and Valadi K. Jayaraman
[Abstract]
[Purchase
Article]
[PMID:
19519330 PubMed - indexed for MEDLINE]
Controlling Feature Selection in Random
Forests of Decision Trees Using a Genetic Algorithm: Classification
of Class I MHC Peptides Pp. 514-519
Loren Hansen, Ernestine A. Lee, Kevin Hestir,
Lewis T. Williams and David Farrelly
[Abstract] [Purchase
Article] [PMID:
19519331 PubMed - indexed for MEDLINE]
Meet the Guest Editor Pp. 520
General Articles
Profiling Human Saliva Endogenous Peptidome via a
High Throughput MALDI-TOF-TOF Mass Spectrometry Pp.
521-531
Chun-Ming Huang and Wenhong Zhu
[Abstract]
[Purchase
Article]
[PMID:
19519332 PubMed - indexed for MEDLINE]
High Throughput Heme Assay by Detection of Chemiluminescence
of Reconstituted Horseradish Peroxidase Pp. 532-535
Shigekazu Takahashi and Tatsuru Masuda
[Abstract]
[Purchase
Article]
[PMID:
19519333 PubMed - indexed for MEDLINE]
Multicomponent One-Pot Reactions: Synthesis of Some
New 6 Oxopyrano [2,3-c]Isochromenes by Condensation
of Homophthalic Anhydride, Dialkyl Acetylenedicarboxylate,
and Isocyanides Pp. 536-542
Ali A. Mohammadi, Roya Akbarzadeh and
Hamed Rouhi
[Abstract]
[Purchase
Article]
[PMID:
19519334 PubMed - indexed for MEDLINE]
Abstracts
[Back to top]
Editorial: Machine Learning for Virtual Screening
(Part 2)
Note from the CCHTS Editor
This is the second of a two-part series on Machine Learning
for Virtual Screening. The Introduction that was included
in the first part of this special issue, written by the guest
editor Ovidiu Ivanciuc, is reprinted here.
Computer-assisted drug design is used to increase the chances
of finding valuable drug candidates, by applying a wide range
of computational methods, such as machine learning, structure-activity
relationships, quantitative structure-activity relationships,
molecular mechanics, quantum mechanics, molecular dynamics,
and drug-protein docking. Machine learning is an important
field of artificial intelligence, and includes a diversity
of methods and algorithms that extract rules and functions
from large datasets. The most important algorithms are linear
discriminant analysis, artificial neural networks, decision
trees, lazy learning, k-nearest neighbors, Bayesian
methods, Gaussian processes, support vector machines, and
kernel algorithms. This special issue presents a representative
selection of machine learning applications for the virtual
screening of chemical libraries.
Machine learning is a rich and dynamic field, with new methods
proposed constantly, which makes difficult to estimate the
quality of predictions expected from a particular algorithm.
Schwaighofer et al. explore the theoretical and practical
aspects of estimating the confidence (error bars) of predictions
obtained with quantitative structure-activity relationships
based on three prevalent nonlinear regression methods, namely
support vector regression, Gaussian processes, and decision
trees. This practical aspect of estimating biological activities
is currently overlooked in many structure-activity models,
but the algorithms presented in this paper demonstrate an
efficient approach in computing confidence levels for activity
predictions.
Naïve Bayesian classifiers are robust and efficient algorithms
for the rapid virtual screening of large compound libraries.
Klon presents a substantial and comprehensive review of Bayesian
classifiers that are currently used in drug design and discovery.
Bayesian models have consistently been shown to be tolerant
of noisy training data, often outperforming more elaborated
machine learning algorithms, and may provide reliable predictions
even when trained with limited amounts of experimental data.
Alternatively, Bayesian classifiers have been used as an effective
post-processing technique to integrate sets of predictions
obtained with other machine learning methods.
Ligand-protein docking is an effective approach in selecting
promising inhibitors, but its main drawback is the large computation
time necessary to screen large chemical libraries. Plewczynski
et al. propose a hybrid method in which a fast machine
learning algorithm, random forest, is coupled with ligand-protein
docking to obtain a virtual screening procedure that demonstrates
in practical applications both speed and reliable predictions.
The random forest machine learning is trained with predictions
obtained from ligand-protein docking and scoring, and thus
the virtual screening procedure may be applied even when trained
only with limited number of experimental data.
Endocrine disrupting chemicals are adversely affecting human
and wildlife health through a variety of mechanisms, mainly
estrogen receptor-mediated mechanisms of toxicity. Liu, Yao,
and Gramatica present a broad overview of classification and
regression applications of machine learning in modeling of
estrogen-like chemicals. The comparative analysis of published
models shows that linear models are fast and easy to apply,
but nonlinear approaches, such as artificial neural networks
and support vector machines, provide better predictions. Effective
models to identify possible estrogens are valuable tools for
government regulators and for the chemical industry.
Among the absorption, distribution, metabolism, elimination,
and toxicity properties (ADMET), unfavorable oral bioavailability
is a major cause in rejecting drug candidates. Hou et
al. review the most important machine learning models
for the prediction of passive intestinal absorption and oral
bioavailability. The article also compares a traditional classification
method (recursive partitioning) with a more recent addition
to the machine learning algorithms (support vector machines),
showing that support vector machines give better predictions.
The influence of other training parameters, such as dataset
size, is also investigated.
Ant colony optimization is a metaheuristic algorithm proposed
by Marco Dorigo in 1992, and inspired by the behavior of ants
seeking a path between their colony and a source of food.
Kulkarni, Jayaraman and co-workers propose an original classification
algorithm that combines ant colony optimization with random
forest, thus exploring the search space to select a feature
subset with high prediction ability. The novel method was
tested with success for predicting peptides binding affinity
for major histocompatibility complex (MHC) class I molecules.
Structural descriptor selection or feature selection is an
important component in developing a predictive machine learning
model for virtual screening. Hansen, Farrelly and co-workers
describe the GenSelect procedure that performs feature selection
in random forests of decision trees with a genetic algorithm.
A genetic algorithm is a global optimization technique inspired
by the biological processes of mutation, selection, crossover,
and inheritance. GenSelect was evaluated for several problems
proposed for the Comparative Evaluation of Prediction Algorithms
(CoEPrA, http://www.coepra.org) competition.
Ovidiu Ivanciuc
(Guest Editor)
Department of Biochemistry and Molecular Biology
University of Texas Medical Branch
301 University Boulevard
Galveston
TX 77555-0857
USA
E-mail: ivanciuc@gmail.com
[Back to top]
[Purchase
Article] [PMID:
19519325 PubMed - indexed for MEDLINE]
How Wrong Can We Get? A Review of Machine Learning Approaches
and Error Bars
Anton Schwaighofer, Timon Schroeter, Sebastian
Mika and Gilles Blanchard
A large number of different machine learning methods
can potentially be used for ligand-based virtual screening.
In our contribution, we focus on three specific nonlinear
methods, namely support vector regression, Gaussian process
models, and decision trees. For each of these methods, we
provide a short and intuitive introduction. In particular,
we will also discuss how confidence estimates (error bars)
can be obtained from these methods. We continue with important
aspects for model building and evaluation, such as methodologies
for model selection, evaluation, performance criteria, and
how the quality of error bar estimates can be verified. Besides
an introduction to the respective methods, we will also point
to available implementations, and discuss important issues
for the practical application.
[Back to top]
[Purchase
Article] [PMID:
19519326 PubMed - indexed for MEDLINE]
Bayesian Modeling in Virtual High Throughput Screening
Anthony E. Klon
Naïve Bayesian classifiers are a relatively recent
addition to the arsenal of tools available to computational
chemists. These classifiers fall into a class of algorithms
referred to broadly as machine learning algorithms. Bayesian
classifiers may be used in conjunction with classical modeling
techniques to assist in the rapid virtual screening of large
compound libraries in a systematic manner with a minimum of
human intervention. This approach allows computational scientists
to concentrate their efforts on their core strengths of model
building. Bayesian classifiers have an added advantage of
being able to handle a variety of numerical or binary data
such as physicochemical properties or molecular fingerprints,
making the addition of new parameters to existing models a
relatively straightforward process. As a result, during a
drug discovery project these classifiers can better evolve
with the needs of the projects from general models in the
lead finding stages to increasingly precise models in the
lead optimization stages that are of particular interest to
a specific medicinal chemistry team. Although other machine
learning algorithms abound, Bayesian classifiers have been
shown to compare favorably under most working conditions and
have been shown to be tolerant of noisy experimental data.
[Back to top]
[Purchase
Article] [PMID:
19519327 PubMed - indexed for MEDLINE]
Virtual High Throughput Screening Using Combined Random Forest
and Flexible Docking
Dariusz Plewczynski, Marcin von Grotthuss,
Leszek Rychlewski and Krzysztof Ginalski
We present here the random forest supervised machine
learning algorithm applied to flexible docking results from
five typical virtual high throughput screening (HTS) studies.
Our approach is aimed at: i) reducing the number of compounds
to be tested experimentally against the given protein target
and ii) extending results of flexible docking experiments
performed only on a subset of a chemical library in order
to select promising inhibitors from the whole dataset. The
random forest (RF) method is applied and tested here on compounds
from the MDL drug data report (MDDR). The recall values for
selected five diverse protein targets are over 90% and the
performance reaches 100%. This machine learning method combined
with flexible docking is capable to find 60% of the active
compounds for most protein targets by docking only 10% of
screened ligands. Therefore our in silico approach
is able to scan very large databases rapidly in order to predict
biological activity of small molecule inhibitors and provides
an effective alternative for more computationally demanding
methods in virtual HTS.
[Back to top]
[Purchase
Article] [PMID:
19519328 PubMed - indexed for MEDLINE]
The Applications of Machine Learning Algorithms in the Modeling
of Estrogen-Like Chemicals
Huanxiang Liu, Xiaojun Yao and
Paola Gramatica
Increasing concern is being shown by the scientific community,
government regulators, and the public about endocrine-disrupting
chemicals that, in the environment, are adversely affecting
human and wildlife health through a variety of mechanisms,
mainly estrogen receptor-mediated mechanisms of toxicity.
Because of the large number of such chemicals in the environment,
there is a great need for an effective means of rapidly assessing
endocrine-disrupting activity in the toxicology assessment
process. When faced with the challenging task of screening
large libraries of molecules for biological activity, the
benefits of computational predictive models based on quantitative
structure-activity relationships to identify possible estrogens
become immediately obvious. Recently, in order to improve
the accuracy of prediction, some machine learning techniques
were introduced to build more effective predictive models.
In this review we will focus our attention on some recent
advances in the use of these methods in modeling estrogen-like
chemicals. The advantages and disadvantages of the machine
learning algorithms used in solving this problem, the importance
of the validation and performance assessment of the built
models as well as their applicability domains will be discussed.
[Back to top]
[Purchase
Article] [PMID:
19519329 PubMed - indexed for MEDLINE]
Recent Developments of In Silico Predictions of Intestinal
Absorption and Oral Bioavailability
Tingjun Hou, Youyong Li, Wei Zhang and
Junmei Wang
Among the absorption, distribution, metabolism, elimination,
and toxicity properties (ADMET), unfavorable oral bioavailability
is indeed an important reason for stopping further development
of the drug candidates. Thus, predictions of oral bioavailability
and bioavailability-related properties, especially intestinal
absorption are areas in need of progress to aid pharmaceutical
drug development. In this article, we review recent developments
in the prediction of passive intestinal absorption and oral
bioavailability. The advances in the datasets used for model
building, the molecular descriptors, the prediction models,
and the statistical modeling techniques, are summarized. Furthermore,
we compared the performance of one machine learning method,
support vector machines (SVM), and one traditional classification
method, recursive partitioning (RP), on the predictions of
passive absorption. Our comparisons demonstrate that the complex
machine learning method could give better predictions than
the traditional approach. Finally we discuss the current challenges
that remain to be addressed.
[Back to top]
[Purchase
Article] [PMID:
19519330 PubMed - indexed for MEDLINE]
Feature Selection and Classification Employing Hybrid Ant
Colony Optimization/Random Forest Methodology
Diwakar Patil, Rahul Raj, Prashant Shingade,
Bhaskar Kulkarni and Valadi K. Jayaraman
Accurate classification of instances depends on identification
and removal of redundant features. Classification of data
having high dimensionality is usually performed in conjunction
with an appropriate feature selection method. Feature selection
enables identification of the most informative feature subset
from the enormously vast search space that can accurately
classify the given data. We propose an ant colony optimization
(ACO)/random forest based hybrid filter-wrapper search technique,
which traverses the search space and selects a feature subset
with high classifying ability. We evaluate the performance
of our algorithm on four widely studied CoEPrA (Comparative
Evaluation of Prediction Algorithms, http://coepra.org) datasets.
The performance of the software ants mediated hybrid filter/wrapper
approach compares well with the available competition results.
Thus, the proposed Ant Colony Optimization based technique
can effectively find small feature subsets capable of classifying
with a very good accuracy and can be employed for feature
subset selection with a high level of confidence.
[Back to top]
[Purchase
Article]
[PMID: 19519331 PubMed - indexed for MEDLINE]
Controlling Feature Selection in Random Forests of Decision
Trees Using a Genetic Algorithm: Classification of Class I
MHC Peptides
Loren Hansen, Ernestine A. Lee, Kevin Hestir,
Lewis T. Williams and David Farrelly
Feature selection is an important challenge in many classification
problems, especially if the number of features greatly exceeds
the number of examples available. We have developed a procedure
- GenForest - which controls feature selection in
random forests of decision trees by using a genetic algorithm.
This approach was tested through our entry into the Comparative
Evaluation of Prediction Algorithms 2006 (CoEPrA) competition
(accessible online at: http://www.coepra.org). CoEPrA was
a modeling competition organized to provide an objective testing
for various classification and regression algorithms via
the process of blind prediction. In the competition GenForest
ranked 10/23, 5/16 and 9/16 on CoEPrA classification problems
1, 3 and 4, respectively, which involved the classification
of type I MHC nonapeptides i.e. peptides containing nine amino
acids. These problems each involved the classification of
different sets of nonapeptides. Associated with each amino
acid was a set of 643 features for a total of 5787 features
per peptide. The method, its application to the CoEPrA datasets,
and its performance in the competition are described.
[Back to top]
[Purchase
Article] [PMID:
19519332 PubMed - indexed for MEDLINE]
Profiling Human Saliva Endogenous Peptidome via a
High Throughput MALDI-TOF-TOF Mass Spectrometry
Chun-Ming Huang and Wenhong
Zhu
Establishment of a saliva protein/peptide signature will
provide important information for clinical diagnostics and
prognosis of human disease. We digested human whole saliva
with trypsin to create a tryptic digest salivary peptidome.
Proteins/peptides were subsequently identified by high throughput
tandem mass spectrometry in conjunction with database searching.
Sixty-three saliva peptides corresponding to twenty-two saliva
proteins were identified. Thirty of sixty-three saliva peptides
with non-specific tryptic cleavage sites were derived from
proline-rich proteins, mucin 7, statherin and collagen. Several
peptides derived from proline-rich proteins exhibit proline
(Pro) - glutamine (Gln) C-termini (-PQ C-termini). Seven peptides
with -PQ C-termini were identified in undigested whole saliva,
suggesting that peptides with -PQ C-termini indigenously exist
in human saliva. Peptides with -PQ C-termini are known to
bind oral bacteria and exhibit properties characteristic of
innate-immunity peptides. Thus, a saliva peptidome containing
peptides with -PQ C-termini, as presented here, may reinforce
the development of innate-immunity-related disease monitoring
using non-invasive saliva samples and mass spectrometry-based
techniques.
[Back to top]
[Purchase
Article] [PMID:
19519333 PubMed - indexed for MEDLINE]
High Throughput Heme Assay by Detection of Chemiluminescence
of Reconstituted Horseradish Peroxidase
Shigekazu Takahashi and Tatsuru
Masuda
In living organisms, heme is an essential molecule for
various biological functions. Recent studies also suggest
that heme functions as organelle-derived signal that regulates
fundamental cell processes. Furthermore, estimation of heme
is widely used for studying various blood disorders. In this
regard, development of a rapid, sensitive, and high throughput
heme assay has been sought. The most frequently used method
of measuring heme by pyridine hemochrome is time, labor, and
material intensive, and therefore limiting in its utility
for large scale, high throughput analysis. Recently, we reported
alternative method that is sensitive and specific to heme,
which is based on the ability of horseradish peroxidase (HRP)
apo-enzyme to reconstitute with heme to form an active holo-enzyme.
Here, we developed high throughput heme assay by performing
reactions on multi-well plate with highly sensitive chemiluminescence
detection reagents. Detection of chemiluminescence in charged
coupled device (CCD)-based gel doc apparatus enables simultaneous
measurement of multiple samples. Furthermore, the high sensitivity
of this assay allowed a direct measurement of heme in solvent
extracts after dilution. This assay is sensitive, quick, provides
a large dynamic range, and is well suited for large-scale
analysis of heme extracted from minute amount of samples.
[Back to top]
[Purchase
Article]
[PMID:
19519334 PubMed - indexed for MEDLINE]
Multicomponent One-Pot Reactions: Synthesis of Some
New 6 Oxopyrano [2,3-c]Isochromenes by Condensation
of Homophthalic Anhydride, Dialkyl Acetylenedicarboxylate,
and Isocyanides
Ali A. Mohammadi, Roya Akbarzadeh and
Hamed Rouhi
A novel three-component, one-pot condensation of the zwitterion
generated from dialkyl acetylenedicarboxylate and isocyanides
with homophthalic anhydride is described. The reaction affords
new 6-oxopyrano[2,3-c]isochromenes in good yield. Isochromenes
have been reported to possess diverse biological activities
such as antibacterial, antifungal, antiinflammatory, and antiangiogenic
effects. Moreover, Theses important compounds are found in
various natural products.
|