Ensuring safe and clean water availability is vital to the health of not only
human beings but also all species. Additionally, it is crucial for the sustainability of the
environment. With the emergence of advanced technologies like machine learning,
predictive models can significantly contribute to assessing and managing water quality.
Current research proposes a methodology that predicts water quality using several
machine learning classifiers on a dataset comprising diverse parameters, such as pH
levels, dissolved oxygen, turbidity, and other pollutants, collected from multiple water
sources. Initially, the data were preprocessed to remove missing values and outliers.
Feature engineering was employed to identify the most relevant parameters that
contribute to water quality. Several popular machine learning classifiers, including
Random Forest, Support Vector Machines, Decision Trees, and XGBoost, were
evaluated and compared for their performance in predicting water quality. The trained
models were validated and tested using cross-validation techniques to ensure
generalizability and resilience. The research findings demonstrated that the proposed
method is effective in accurately forecasting water quality levels. The XGBoost, in
particular, exhibited superior performance with high accuracy and minimal overfitting.
Additionally, feature importance analysis revealed key factors influencing water
quality, providing valuable insights for policymakers and environmentalists.
Keywords: Cross-validation, Machine learning classifiers, ROC curve, Water quality assessment.