Title:Early Prediction of Malignant Mesothelioma: An Approach Towards Non-invasive Method
Volume: 16
Issue: 10
Author(s): Shakir Shabbir, Muhammad Shahzad Asif, Talha Mahboob Alam and Zeeshan Ramzan*
Affiliation:
- Department of Computer Science & Engineering, Faculty of Electrical Engineering, The University of Engineering and Technology, Lahore,Pakistan
Keywords:
Data reduction, mesothelioma, gradient boosted decision tree, machine learning, malignant, biopsy.
Abstract:
Background: Malignant Mesothelioma (MM) is a rare but aggressive tumor that arises in the
lungs. Commonly, costly imaging and laboratory resources, i.e. (X-rays imaging, Magnetic Resonance
Imaging (MRI), Positron Emission Tomography (PET) scans, biopsies, and blood tests) have already
been utilized for the diagnosis of MM. Even though these diagnostic measures are expensive and unavailable
in distant areas, some of these diagnosis methods are also very painful for the patient, i.e., biopsy
and cytology of pleural fluid.
Objective: In this study, we proposed a diagnosis model for early identification of MM via machine
learning techniques. We explored the health records of 324 Turkish patients, which show the symptoms
related to MM. The data of patients include socio-economic, geographical, and clinical features.
Methods: Different feature selection methods have been employed for the selection of significant features.
To overcome the data imbalance problem, various data-level resampling techniques have been
utilized to obtain efficient results. The Gradient Boosted Decision Tree (GBDT) method has been used
to develop the diagnostic model. The performance of the GBDT model is also compared with traditional
machine learning algorithms.
Results: Our model's results outperformed other models, both on balance and imbalance data. The results
clearly show that undersampling techniques outperformed by imbalanced data even without
resampling based on accuracy and Receiving Operating Characteristic (ROC) value. Conversely, it has
also been observed that oversampling techniques outperformed undersampling and imbalanced data
based on accuracy and ROC. All classifiers employed in this study achieved efficient results utilizing
feature selection-based methods (OneR, information gain, and Relief-F), but the results of the other two
methods (gain ratio and Correlation) were not entirely promising. Finally, when the combination of
Synthetic Minority Oversampling Technique (SMOTE) and OneR was applied with GBDT, it gave the
most favorable results based on accuracy, F-measure, and ROC.
Conclusion: The diagnosis model has also been deployed to assist doctors, patients, medical practitioners,
and other healthcare professionals for early diagnosis and better treatment of MM.