Water Parameter Estimation Application U
Water Parameter Estimation Application U
2
Department of Computer Science, University of Botswana
Private Bag UB 0022, Gaborone, Botswana
2 [email protected]
3
Swift Limited Company
P.O. Box 1170, Blantyre, Malawi
3 [email protected]
6
Water & Sanitation Department, Mzuzu University
Private Bag 201, Mzuzu, Malawi
6 [email protected]
Abstract—About 61.7% of Malawi’s population relies on [2]. To ensure that people access safe drinking water that
groundwater as a drinking water source. Particularly in rural aligns with the requirements established by the Malawi
settings, groundwater provides almost all water needs. In many Bureau of Standards (MBS) and the World Health
developing countries, however, there is a need for more routine Organization (WHO), water-supplying organisations
assessments of groundwater quality to ensure safe drinking
water. For example, in Malawi, groundwater and surface waters
proactively employ water monitoring techniques and record
often exceed threshold parametric values set by the Malawi significant water parameters such as total dissolved solids
Bureau of Standards (MBS) and the World Health Organization (TDS), total hardness in the form of Calcium Carbonates,
(WHO). The current process of determining water parameters is Manganese, and Calcium ions.
tedious and expensive, requiring specialised equipment and The current process of monitoring and determining water
repeated on-site visits for data collection. This paper reports on parametric values in Malawi is tedious and expensive; it
developing a mobile application that estimates water parametric requires specialised equipment and repeated on-site visits for
values by deducing the relationships between the parameters. data collection [3], [4]. In the likely event of the unavailability
The study used a dataset of 64 samples with eight features. We of funds or some equipment, this problem results in numerous
performed correction matrices and feature importance to
identify the relationship among the variables and build models to
missing values in the water quality database. Even if the
predict the parameters. The usability evaluation results show necessary instruments for such analysis are present, data
that the application is useful, practical, easy to use, learnable, organisation is a problem. For example, some data are lost
and satisfactory. when relayed from one level to another or when cross-sectoral
sharing is done among stakeholders. Because Malawi and
I. INTRODUCTION other low- and middle-income countries (LMICs) operate
Globally, groundwater serves a significant percentage of within a minimal budget, gaps often appear in the database
the world’s population. Approximately 97 per cent of the because most data is not collected [4]. This inevitably affects
world’s freshwater exists as groundwater. As such, most the overall management of the available water resources
regions regard groundwater as one of the most reliable sources because decisions are not informed by accurate data but rather
of safe water, free from biological contaminants. Most by guesswork.
countries, especially developing ones, opt for groundwater as The convergence of advanced technology and
an economically viable water source because it does not environmental science provides opportunities for addressing
require expensive treatments compared to surface. This is these challenges [1], [5], [6]. This study presents a technique
because groundwater does not contain biological contaminants for estimating water parametric values using statistical data.
[1]. Malawi faces many water-related challenges that require By leveraging the power of statistical learning, this research
special consideration if the country is to mitigate or endeavours to enhance water quality dynamics in Malawi and
completely eradicate them. Malawi’s poor water sanitation contribute to formulating data-driven strategies for sustainable
and hygiene cost the government US$57 million annually, or water management. By meticulously examining statistical
1.1% of its GDP, due to health costs and productivity losses tools and methodologies and their application to Malawi’s
https://google.academia.edu/JournalofComputerScience 1 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
unique water needs and environmental context, this study selection, heterogeneous data fusion, hyperparameter tuning,
illuminates a path towards a cheaper, more effective and time- and model evaluation. [5] aims to assess groundwater quality
saving water parameter estimation, with implications for parameters in Majmaah City, Saudi Arabia utilising
broader global contexts facing similar challenges. multivariate statistical approaches like Principal Component
features. Analysis (PCA) and Factor Analysis (FA). Geostatistics
techniques and GPS data were employed to characterise water
II. RELATED WORK quality parameters in 2D and presented as contour maps and
High water quality is essential for human health, 3D representations [19].
environmental sustainability, and economic growth. To improve the explainability, presentation of results, and
Monitoring and predicting water quality parameters is crucial deployability of the model, [20] and [21] argue that most
for protecting aquatic ecosystems, ensuring safe drinking developed models are left unused at the conceptual stage due
water, and managing water resources effectively [4], [7]. to a lack of consideration for deployment. Their study
However, water quality is influenced by various natural and proposes deploying models using explainable, user-friendly,
human factors. Parameters such as temperature, pH, electrical and affordable technologies. Guided by the Technology
conductivity, dissolved oxygen, alkaline, calcium, sodium, Acceptance Model, an Android mobile application was
potassium, magnesium, chloride, nitrite, and phosphate are developed to facilitate the model’s smooth implementation
critical. Traditional water quality monitoring methods rely on and adoption.
physical sensors and lab analysis, which can be costly, time-
consuming, and limited in scope [1], [8], [9], [10]. In other III. METHODOLOGY AND RESULTS
cases, satellite remote sensing-based techniques measure A. Dataset and feature selection
active WQI, such as Chl-a, SS, coloured dissolved organic
matter (CDOM), and turbidity, over broad areas and at regular The dataset, which contains 64 samples and nine features
intervals. Nonetheless, this approach has limitations and may (Ca, EC, Total Hardness, Turbidity, Total coliform, SO4, pH,
not work well in all environments. Above all, the use of and Streptococcus), was compiled by the Central Region
satellites is an expensive undertaking. Therefore, this study Water Board (CRWB) from 64 boreholes in Dowa district,
employs predictive statistical modelling to overcome some Malawi. The CRWB is a parastatal organization founded to
highlighted challenges. provide Malawi’s central region with potable water for
Statistical learning (SL) methods, including support vector residential, commercial, and industrial use. We chose the
machine and decision tree algorithms, have been well- features that correlate with each feature required for estimate.
developed for nonlinear regression analysis in recent years The standard practice for corroboration or rejection of a
(Kenda et al., 2020). These methods have been proven theory based on correlation is to use more than one method of
effective in both small and large-scale cases, attracting the determining the correlation. We used two methods to identify
attention of the environmental and geophysical modelling and ascertain correlations among the dataset’s features. The
community. [11], [12], [13], [14] demonstrated that water correlation matrices showed strong correlations between TDS
parametric values can be calculated from the relationship of and EC and between total hardness and Ca. Therefore, the
the features in a dataset. For example, [13] observed that the study aimed to predict these four features. Figure 1 below
concentration of TDS can be measured easily from EC value. illustrates the correlation matrix of the dataset, calculated
This study used correlation and feature importance to identify using the Pearson correlation coefficient method. The Pearson
relationships and select features for prediction from a custom correlation coefficient determines the strength and direction of
dataset. In addition to these applications, some studies have the linear. The correlation values range from1to+1, where +1
utilised both SL and DA methods to achieve accurate and indicates a perfect positive linear relationship, 1 indicates a
efficient modelling. [15] used random forest regression to perfect negative linear relationship, and 0 indicates no linear
build a relationship between remote sensing data and field relationship. One limitation of the Pearson correlation
observation data to estimate dryland surface indicators. [16] coefficient is its inability to capture non-linear relationships
used an artificial neural network to imitate the local ensemble among variables. The distance correlation method was
transform Kalman filter and enhance prediction efficiency but employed to detect non-linear relationships in the data. The
did not improve the DA algorithm. process is versatile because it can effectively apply to linear
[17] used the extended Kalman filter as a substitute for the and non-linear data sets. One key advantage of this metric is
backpropagation training of deep belief networks in that it does not make any assumptions regarding the normality
infrastructure sustainability analysis. The findings of these of the input vectors, thereby eliminating any potential biases
studies imply that incorporating SL methods into sequential in the results. The results of this metric range from 0 to 2, with
DA is a promising means of achieving more accurate time- 0 indicating a perfect correlation and 2 indicating a perfect
varying parameter tracking. In their 2020 study, [18] negative correlation. Figure 2 presents a correlation matrix
comprehensively assessed various statistical modelling calculated using the distance correlation. Additionally, we
methods for both ground- and surface-level prediction employed feature importance to identify essential features in
scenarios. Additionally, they explored practical applications of predicting a feature of concern. Understanding the importance
data-driven modelling, including feature generation, feature of each feature is crucial for gaining insights into the model’s
decision-making process and identifying key factors
https://google.academia.edu/JournalofComputerScience 2 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
influencing its predictions. The feature importance score is a the imputation method, which involved filling in missing
metric to assess a feature’s contribution to a model’s values with estimated values of the mean of the columns. We
predictive performance. A higher score indicates a more used the interquartile range of 5-95% to remove outliers from
substantial predictive power, whereas a lower score suggests the dataset.
relatively less influence on predictions.
C. Data Normalisation
Statistical models may produce misleading results when a
dataset contains features that vary significantly in magnitude
(scale). This is because features with larger magnitudes
dominate the learning process [24]. Normalisation aims to
change the dataset’s feature values to a standard scale without
distorting differences in the ranges of values or losing
information. We employed the Min-Max Scaling method to
scale the features to a range between 0 and 1. For a given
feature 𝑥, the scaled value 𝑥′ in the range [minnew,maxnew] is
computed as follows:
https://google.academia.edu/JournalofComputerScience 3 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
The following metrics were used to evaluate the 1) Mean Absolute Error: The Mean Absolute Error (MAE) is
performance of regression models: a metric used to measure the average difference between the
actual and predicted values. A lower MAE suggests that the
Table I Feature Importance
Total Total
Ca TDS EC Turbidity SO4 pH Streptococcus
Hardness coliform
Ca 0.00 0.01 0.81 0.21 0.02 0.10 0.06 0.07
Total Hardness 0.80 0.00 0.00 0.00 0.05 0.05 0.07 0.05
Total coliform 0.01 0.00 0.00 0.00 0.36 0.03 0.03 0.38
https://google.academia.edu/JournalofComputerScience 4 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
https://google.academia.edu/JournalofComputerScience 5 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
The application is difficult to use despite a 76%(16) 14%(3) 4%(1) 4%(1) 1.38 1.06
comprehensive guide
The application produces accurate results. 14%(3) 9%(2) 76%(16) 4.61 4.15
Usefulness 1 2 3 4 5
The application makes things easier to 4%(1) 19%(4) 4%(1) 71%(15) 4.42 4.01
accomplish
The application does everything I would 9%(2) 9%(2) 19%(4) 61%(13) 4.3 4.80
expect it to do
The application saves me time when I use it. 9%(2) 4%(1) 9%(2) 76%(16) 4.52 4.10
Ease of use 1 2 3 4 5
The application is easy to use 4%(1) 4%(1) 4%(1) 85%(18) 4.71 4.25
The application is simple to use 4%(1) 4%(1) 4%(1) 85%(18) 4.71 4.25
The system requires a few steps to 4%(1) 4%(1) 90%(19) 4.61 3.85
accomplish what I want to do.
Learnability 1 2 3 4 5
The application is easy to remember how to 4%(1) 4%(1) 9%(2) 80%(17) 4.66 4.20
use
The application is easy to learn. 14%(3) 4%(1) 80%(17) 4.52 4.12
I learnt to use the application quickly. 9%(2) 4%(1) 4%(1) 80%(17) 4.57 4.15
Satisfaction 1 2 3 4 5
I am satisfied with the application. 4%(1) 4%(1) 4%(1) 85%(18) 4.71 4.25
The application works the way I expected 14%(3) 85%(18) 4.85 4.34
I would recommend the application to other 9%(2) 4%(1) 85%(18) 4.76 4.27
users.
degrees of agreement. The usability evaluation results in R2, and MPE. The low values of MAE, MSE, and RMSE
Table 3 suggest that most participants responded positively to indicate that the models predict values that closely align with
statements affirming the application’s usability. The results the actual measurements, signifying the accuracy of the
indicate the application is a fitting solution for water estimation. The high R2 Score and Adjusted R2 values
parameter estimation. suggest a strong correlation between the predicted and
observed values, further affirming the reliability of the models.
IV. DISCUSSION Furthermore, the mobile application’s evaluation of twenty-
This study represents a significant advancement in the field one participants yielded promising results regarding its
of water quality monitoring. We aimed to provide a usability and effectiveness in real-world scenarios.
convenient and accessible tool for professionals and lay users Participants reported ease of use and satisfaction with the
to assess water parameters with reasonable accuracy and cost. application’s interface and functionality. Moreover, the
We developed and optimised regression models to predict application provided rapid and convenient access to water
water parameters using a dataset encompassing various water parameter estimates, which could prove invaluable for
sources and conditions. The performance of the models was fieldwork, environmental monitoring, and decision-making
evaluated using the MAE, RMSE, MSE, R2 Score, Adjusted processes. However, it is essential to acknowledge certain
https://google.academia.edu/JournalofComputerScience 6 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 22, No. 4, July-August 2024
limitations and areas for future improvement. While the Environ. Res. Risk Assess., vol. 32, no. 3, pp. 1–15, Mar. 2018, doi:
10.1007/s00477-017-1394-z.
developed models demonstrate robust performance overall,
[9] M. H. Gholizadeh, A. M. Melesse, and L. N. Reddi, “A
there may be specific water conditions or parameters for Comprehensive Review on Water Quality Parameters Estimation
which they are less accurate. Further refinement and Using Remote Sensing Techniques.,” Sensors, vol. 16, no. 8, p.
validation of the models using diverse datasets could enhance 1298, Aug. 2016, doi: 10.3390/s16081298.
[10] Godson Ebenezer Adjovu, H. Stephen, and Sajjad Ahmad, “A
their generalizability and reliability across different contexts.
Machine Learning Approach for the Estimation of Total Dissolved
Additionally, the mobile application’s evaluation sample Solids Concentration in Lake Mead Using Electrical Conductivity
was relatively small, consisting of twenty-one participants. and Temperature,” Water, 2023, doi: 10.3390/w15132439.
While this sample provided valuable insights into initial user [11] E. Brown, M. Skougstad, and M. Fishman, “Methods for
collection and analysis of water samples USGS Water-Supply Pap.
perceptions and experiences, a more extensive and diverse
1454,” 1960.
participant pool would offer a more comprehensive [12] J. D. Hem, Study and interpretation of the chemical characteristics
assessment of the application’s usability and effectiveness. of natural water, vol. 2254. Department of the Interior, US
Geological Survey, 1985.
V. CONCLUSION [13] A. Rusydi, “Correlation between conductivity and total dissolved
solid in various type of water: a review, IOP conference series:
In conclusion, this study presents the development of a Earth and environmental science,” IOP Publ. Doi, vol. 10, pp.
water parameter estimation application using statistical 1755–1315, 2018.
learning methods tailored to address the specific challenges of [14] N. Walton, “Electrical conductivity and total dissolved solids—
what is their precise relationship?,” Desalination, vol. 72, no. 3, pp.
water quality monitoring in Malawi. The application, designed 275–292, 1989.
for environments where data is scarce and traditional [15] Lingcheng Li et al., “A machine learning approach targeting
equipment is costly, successfully predicts water parameters parameter estimation for plant functional type coexistence
with high accuracy, as demonstrated by the low error rates and modeling using ELM-FATES (v2.0),” Geosci. Model Dev., 2023,
doi: 10.5194/gmd-16-4017-2023.
high correlation scores achieved in model evaluations. The [16] D. Carvajal-Patiño and R. Ramos-Pollán, “Synthetic data
positive usability feedback from participants further validates generation with deep generative models to enhance predictive
the application's practicality and effectiveness in real-world tasks in trading strategies,” Res. Int. Bus. Finance, vol. 62, p.
scenarios. However, there remains potential for future 101747, 2022, doi: https://doi.org/10.1016/j.ribaf.2022.101747.
[17] J. Wei, J. Zhao, X. Lei, Z. Zhang, and H. Wang, “Statistical-
improvements, including expanding the model's adaptability Learning-Based Ensemble Data Assimilation Methods for
to diverse water conditions and broadening the evaluation Parameter Estimation in Hydrodynamic Models,” Mar. 29, 2022,
sample size for more comprehensive feedback. Overall, this Rochester, NY: 4069683. doi: 10.2139/ssrn.4069683.
research contributes significantly to the field of water quality [18] K. Kenda, J. Peternelj, N. Mellios, D. Kofinas, M. Čerin, and J.
Rožanec, “Usage of statistical modeling techniques in surface and
management by providing a cost-effective, efficient, and user- groundwater level prediction,” J. Water Supply Res. Technol.-
friendly solution with broader implications for similar Aqua, vol. 69, no. 3, pp. 248–265, Apr. 2020, doi:
contexts globally. 10.2166/aqua.2020.143.
[19] B. Khalil, T. B. M. J. Ouarda, and A. St-Hilaire, “Estimation of
REFERENCES water quality characteristics at ungauged sites using artificial
neural networks and canonical correlation analysis,” J. Hydrol.,
[1] M. Al-Mukhtar and F. Al-Yaseen, “Modeling Water Quality vol. 405, no. 3, pp. 277–287, Aug. 2011, doi:
Parameters Using Data-Driven Models, a Case Study Abu-Ziriq 10.1016/j.jhydrol.2011.05.024.
Marsh in South of Iraq,” Hydrology, vol. 6, no. 1, p. 24, Mar. 2019, [20] N. Feldkamp, “Data Farming Output Analysis Using Explainable
doi: 10.3390/hydrology6010024. AI,” in Proceedings of the Winter Simulation Conference, in
[2] UNICEF, “Malawi Annual Country Report,” Ctry. Strateg. Plan WSC ’21. Phoenix, Arizona: IEEE Press, 2022.
2019-2023, 2021. [21] C. Nyasulu and W. Dominic Chawinga, “Using the decomposed
[3] C. L. Chidammodzi and V. S. Muhandiki, “Water resources theory of planned behaviour to understand university students’
management and Integrated Water Resources Management adoption of WhatsApp in learning,” E-Learn. Digit. Media, vol. 16,
implementation in Malawi: Status and implications for lake basin no. 5, pp. 413–429, Sep. 2019, doi: 10.1177/2042753019835906.
management,” Lakes Reserv. Sci. Policy Manag. Sustain. Use, vol. [22] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago,
22, no. 2, pp. 101–114, 2017, doi: 10.1111/lre.12170. and O. Tabona, “A survey on missing data in machine learning,” J.
[4] C. Mussa and J. F. Kamoto, “Groundwater Quality Assessment in Big Data, vol. 8, no. 1, pp. 1–37, 2021.
Urban Areas of Malawi: A Case of Area 25 in Lilongwe,” J. [23] R. Ahn, S. Supakkul, L. Zhao, K. Kolluri, T. Hill, and L. Chung,
Environ. Public Health, vol. 2023, no. 1, p. 6974966, 2023, doi: “A Goal-Oriented Approach for Preparing a Machine-Learning
10.1155/2023/6974966. Dataset to Support Business Problem Validation,” in 2021 IEEE
[5] S. S. Ahmed, “Assessment of Groundwater Quality Parameters Intl Conf on Dependable, Autonomic and Secure Computing, Intl
Using Multivariate Statistics- A Case Study of Majmaah, KSA,” Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud
Int. J. Environ. Monit. Anal., vol. 5, no. 2, Art. no. 2, Mar. 2017, and Big Data Computing, Intl Conf on Cyber Science and
doi: 10.11648/j.ijema.20170502.13. Technology Congress (DASC/PiCom/CBDCom/CyberSciTech),
[6] M. Azrour, J. Mabrouki, G. Fattah, A. Guezzaz, and F. Aziz, 2021, pp. 282–289. doi: 10.1109/DASC-PICom-CBDCom-
“Machine learning algorithms for efficient water quality CyberSciTech52372.2021.00057.
prediction,” Model. Earth Syst. Environ., vol. 8, no. 2, pp. 2793– [24] M. Khadr and M. Elshemy, “Data-driven modeling for water
2801, 2022. quality prediction case study: The drains system associated with
[7] UNICEF Malawi, “Water, sanitation and hygiene | UNICEF Manzala Lake, Egypt,” Ain Shams Eng. J., vol. 8, no. 4, pp. 549–
Malawi.” Accessed: Jun. 11, 2024. [Online]. Available: 557, Dec. 2017, doi: 10.1016/j.asej.2016.08.004.
https://www.unicef.org/malawi/water-sanitation-and-hygiene
[8] R. Barzegar, A. A. Moghaddam, J. Adamowski, and B. Ozga-
Zielinski, “Multi-step water quality forecasting using a boosting
ensemble multi-wavelet extreme learning machine model,” Stoch.
https://google.academia.edu/JournalofComputerScience 7 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Call for Papers: International Journal of Computer Science and Information Security (IJCSIS)
Scope and Topics: The International Journal of Computer Science and Information Security (IJCSIS) invites
researchers, practitioners, and academicians to submit original, unpublished contributions covering a wide range of
topics in the field of computer science and information security. We welcome submissions that include but are not
limited to:
• Information Assurance
• Privacy-Enhancing Technologies
Submission Guidelines:
• Manuscripts must be original and not currently under consideration for publication elsewhere. All papers
should be submitted in English.
• The manuscript should follow the IJCSIS formatting guidelines, available on our website:
https://sites.google.com/site/ijcsis/ijcsis
• Submissions must include the title of the paper, abstract, keywords, and full contact information for all
authors.
Important Dates:
Review Process: All submitted papers will undergo a rigorous peer-review process by the IJCSIS editorial board and
selected external reviewers. Authors will be notified of the review results by the notification of acceptance date.a
Contact Information: For further inquiries, please contact the editorial office at:
Email: [email protected]
Website: https://sites.google.com/site/ijcsis/ijcsis