The concept/phenomenon of operons, which are organized genes that work
in a coordinated way in microbes, is well established. Recent developments in genetics,
biochemistry, and bioinformatics have unraveled similar gene arrangements in plants.
Here we aim to develop an algorithm/tool which would help us detect and identify
biosynthetic gene clusters (BGCs) from any input plant genome. Through this tool, we
intend to match or supersede the performance of pre-existing sting tools for BGC
prediction, like the popular plantiSMASH. The predictions models were developed
using the machine learning tool WEKA using the physicochemical properties as data
set to classify between terpene synthases and non-terpene synthases. A set of ten
physicochemical properties were selected and their values were predicted for each of
the 159 proteins (terpene synthases and non-terpene synthases) Employing the random
forest and SMO classifiers, we were able to obtain significantly promising accuracy of
over 90 percent with 66 percent percentage split testing. Accurate prediction of BGCs
in the plants, especially the major food crops like rice, wheat, and corn revolutionize
farming and nutrition for the better.
Keywords: Algorithm, BGC, Mining, PlantiSMASH, Random forest, SMO
WEKA.