Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views15 pages

Processamento de Dados

Uploaded by

uerjoao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views15 pages

Processamento de Dados

Uploaded by

uerjoao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Gesicho et al.

BMC Med Inform Decis Mak (2020) 20:293


https://doi.org/10.1186/s12911-020-01315-7

RESEARCH ARTICLE Open Access

Data cleaning process for HIV‑indicator data


extracted from DHIS2 national reporting system:
a case study of Kenya
Milka Bochere Gesicho1,4* , Martin Chieng Were2,4 and Ankica Babic1,3

Abstract
Background: The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggre-
gate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data
within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches
form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and sys-
tematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report
on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within
DHIS2 from 2011 to 2018 in Kenya, for secondary analyses.
Methods: Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facili-
ties in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeli-
ness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were
extracted from year 2011 to 2018. Van den Broeck et al.’s framework, involving repeated cycles of a three-phase process
(data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step
data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues
were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with
selected issues across eight years.
Results: Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over
100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical
male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or
offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic
area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses.
Conclusions: Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of
the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality
for use in secondary analyses, which could not be secured by automated procedures solemnly.
Keywords: Data-cleaning, dhis2, HIV-indicators, Data management

Background
Routine health information systems (RHIS) have been
implemented in health facilities in many low-and
*Correspondence: [email protected]
1
middle-income countries (LMICs) for purposes such
Department of Information Science and Media Studies, University
of Bergen, Bergen, Norway
as facilitating data collection, management and uti-
Full list of author information is available at the end of the article lization [1]. In order to ensure effectiveness of HIV

© The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creat​iveco​mmons​.org/licen​ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat​iveco​
mmons​.org/publi​cdoma​in/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 2 of 15

programs, accurate, complete and timely monitor- still being progressively improved and implemented in
ing and evaluation (M&E) data generated within these countries using DHIS2 [13].
systems are paramount in decision-making such as Despite data quality approaches having been imple-
resource allocation and advocacy [2]. Monitoring and mented within DHIS2, data quality issues remain a
Evaluation (M&E) plays a key role in planning of any thorny problem, with some of the issues emanating
national health program. De Lay et al. defined M&E as from the facility level [14]. Real-life data like that found
“acquiring, analyzing and making use of relevant, accu- in DHIS2 are often “dirty” consisting of issues such as;
rate, timely and affordable information from multiple incomplete, inconsistent, and duplicated data [15]. Fail-
sources for the purpose of program improvement [2].” ure to detect data quality issues and to clean these data
In order to provide strategic information needed for can lead to inaccurate analyses outcomes [13]. Various
M&E activities in low- and middle-income countries studies have extracted data from DHIS2 for analyses
(LMICs), reporting indicators have been highly advo- [16–20]. Nonetheless, few studies attempt to explicitly
cated for use across many disease domains, with HIV disclose the data cleaning strategies used, resulting errors
indicators among the most common ones reported to identified and the action taken [16–18]. In addition, some
national-level facilities in many countries [3–5]. As of these studies largely fail to exhaustively and system-
such, health facilities use pre-defined HIV-indicator atically describe the steps used in data cleaning of the
forms to collect routine HIV-indicator data on various DHIS2 data before analyses are done [19, 20].
services provided within the facility, which are submit- Ideally, data cleaning should be done systematically,
ted to the national-level [6]. and good data cleaning practice requires transpar-
Over the years, national-level data aggregation sys- ency and proper documentation of all procedures taken
tems, such as the District Health Information Soft- to clean the data [21, 22]. A closer and systematic look
ware 2 (DHIS2) [7], have been widely adopted for use into data cleaning approaches, and a clear outlining of
in collecting, aggregating and analyzing indicator data. the distribution or characteristics of data quality issues
DHIS2 has been implemented in over 40 LMICs with encountered in DHIS2 could be instructive in inform-
the health indicator data reported within the system ing approaches to further ensure higher quality data for
used for national- and regional-level health-related analyses and decision-making. Further, employment of
decision-making, advocacy, and M&E [8]. Massive additional data cleaning steps will ensure that good qual-
amounts of data have been collected within health ity data is available from the widely deployed DHIS2 sys-
information systems such as DHIS2 over the past sev- tem for use in accurate decision-making and knowledge
eral years, thus providing opportunities for secondary generation.
analyses [9]. However, these analyses can only be ade- In this study, data cleaning is approached as a process
quately conducted if the data extracted from systems aimed at improving the quality of data for purposes of
such as DHIS2 are of high quality that is suitable for secondary analyses [21]. Data quality is a complex mul-
analyses [10]. tidimensional concept. Wang and Strong categorized
Furthermore, data within health information systems these dimensions as: intrinsic data quality, contextual
such as DHIS2, are only as good as their quality, as this is data quality, representational and accessibility data qual-
salient for decision-making. As such, various approaches ity [23]. Intrinsic data quality focuses on features that are
have been implemented within systems like DHIS2 to inherent to data itself such as accuracy [23]. Contextual
improve data quality. Some of these approaches include: data quality focuses on features that are relevant in the
(a) validation during data entry in order to ensure data context for the task for data use such as value-added,
are captured using the right formats and within pre- appropriate amount of data, and relevancy [23]. Repre-
defined ranges and constraint; (b) user-defined valida- sentational and accessibility data quality highlights fea-
tion rules; (c) automated outlier analysis functions such tures that are salient within the role of the system such as
as standard deviation outlier analysis (identifies data interpretability, representational consistency, and acces-
values that are numerically extreme from the rest of the sibility [23]. Given that data quality can be subjective
data), and minimum and maximum based outlier analysis and dependent on context, various studies have speci-
(identifies data values outside the pre-set maximum and fied context in relation to data quality [24–26]. Bolchini
minimum values); and (d) automated calculations and et al. specify context by tailoring data that are relevant
reporting of data coverage and completeness [11]. WHO for a given particular use case [27]. Bolchini et al. further
data quality tool has also been incorporated with DHIS2 posit that the process of separating noise (information
to identify errors within the data in order to determine not relevant to a specific task) to obtain only useful infor-
the next appropriate action [12]. Given that this tool is mation, is not an easy task [27]. In this study, data clean-
a relatively new addition to the DHIS2 applications, it is ing is approached from a contextual standpoint, with the
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 3 of 15

intention of retaining only relevant data for subsequent Some of the data quality categories (intrinsic, contex-
secondary analyses. tual, representational and accessibility) [23], have been
Therefore, the aim of this study is to report on the used in cleaning approaches as well as the data qual-
method and results of a systematic and replicable data ity frameworks developed. A closer examination of the
cleaning approach employed on routine HIV-indicator aforementioned approaches reveals focus on assessing
data reports gathered within DHIS2 from 2011 to 2018 intrinsic data quality aspects, which can be categorized
(8 year period), to be used for subsequent secondary further to syntactic quality (conformance to database
analyses, using Kenya as a reference country case. This rules) and semantic quality (correspondence or mapping
approach has specific applicability to the broadly imple- to external phenomena) [42].
mented DHIS2 national reporting system. Our approach Moreover, while tools and approaches exist for data
is guided by a conceptual data-cleaning framework, with quality assessments as well as data cleaning, concerted
a focus on uncovering data quality issues often missed efforts have been paced on assessment of health infor-
by existing automated approaches. From our evaluation, mation system data quality [39, 40], as opposed to clean-
we provide recommendations on extracting and clean- ing approaches for secondary analyses, which are largely
ing data for analyses from DHIS2, which could be of dependent on the context for data use [24]. Wang and
benefit to M&E teams within Ministries of Health and by Strong posited the need for considering data quality with
researchers to ensure high quality data for analyses and respect to context of the tasks, which can be a challenge
decision-making. as tasks and context vary by user needs [23]. Therefore,
specifying the task and relevant features for the task, can
be employed for contextual data quality [23, 43].
Methods With this in mind and based on our knowledge, no
Data cleaning and data quality assessment approaches standard consensus-based approach exists to ensure that
Data cleaning is defined as “the process used to deter- replicable and rigorous data cleaning approaches and
mine inaccurate, incomplete, or unreasonable data documentation are applied on extracted DHIS2 data to be
and then improving the quality through correction of used in secondary analyses. As such, ad hoc data cleaning
detected errors and omissions” [28]. Data cleaning is approaches have been employed for the extracted data
essential to transform raw data into quality data for pur- prior to analyses [16–18]. Moreover, whereas some stud-
poses such as analyses and data mining [29]. It is also an ies provide brief documentation of data cleaning proce-
integral step in the knowledge discovery of data (KDD) dures used [19], others lack documentation, leaving the
process [30]. data cleaning approaches used undisclosed and behind-
There exists various issues within the data, which the-scenes [20]. Failure to disclose approaches used
necessitate cleaning in order to improve its quality [31– makes it difficult to replicate data cleaning procedures,
33]. An extensive body of work exists on how to clean and to ensure that all types of anomalies are systemati-
data. Some of the approaches that can be employed cally addressed prior to use of data for analysis and deci-
include quantitative or qualitative methods. Quantitative sion-making. Furthermore, the approach used in data
approaches employ statistical methods, and are largely extraction and cleaning affects the analysis results [21].
used to detect outliers [34–36]. On the other hand, quali- Oftentimes, specific approaches are applied based on
tative techniques use patterns, constraints, and rules the data set and the aims of the cleaning exercise [10, 44,
to detect errors [37]. These approaches can be applied 45]. Dziadkowiec et al. used Khan’s framework to clean
within automated data cleaning tools such as ARKTOS, data extracted from relational database of an Electronic
AJAX, FraQL, Potter’s Wheel and IntelliClean [33, 37, Health Records (EHR) (10). In their approach, intrinsic
38]. data quality was in our view considered in data clean-
In addition, there are a number of frameworks used in ing with focus on syntactic quality issues (such as con-
assessment of data quality in health information systems, forming to integrity rules). Miao et al. proposed a data
which can be utilized by countries with DHIS2. The Data cleaning framework for activities that involve secondary
Quality Review (DQR) tool developed in collaboration analysis of an EHR [45], which in our view considered
with WHO, Global Fund, Gavi and USAID/MEASURE intrinsic data quality with focus on semantic quality (such
Evaluation provides a standardized approach that aims at as completeness and accuracy). Savik et al. approached
facilitating regular data quality checks [39]. Other tools data cleaning in our view from a contextual perspective,
for routine data quality assessments include the MEAS- which entailed preparing the dataset that is appropriate
URE Evaluation Routine Data Quality Assessment Tool for the intended analysis [44].
(RDQA) [40] and WHO/IVB Immunization Data Quality In this study, we approach data cleaning from a con-
Self-Assessment (DQS) [41]. textual perspective, whereby only data fit for subsequent
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 4 of 15

analyses is retained. Based on our data set, our study’s i Step 1—Outline the analyses or evaluation ques-
data cleaning approach was informed by a conceptual tions: Prior to applying the Van den Broeck et al.’s
data-cleaning framework proposed by Van den Broeck conceptual framework, it is important to identify the
et al. [21]. Van den Broeck et al.’s framework was used exact evaluations or analyses to be conducted, as this
because it provides a deliberate and systematic data helps define the data cleaning exercise.
cleaning guideline that is amenable to being tailored j Step 2—Description of data and study variables: This
towards cleaning data extracted from DHIS2. This frame- step is important for defining the needed data ele-
work presents data cleaning as a three-phase process ments that will be used for the evaluation data set.
involving repeated cycles of data screening, data diag- k Step 3—Create the data set: This step involves iden-
nosis, and data editing of suspected data abnormalities. tifying the data needed and extracting data from rel-
The screening process involves identification of lacking evant databases to generate the final data set. Often-
or excess data, outliers and inconsistencies and strange times, development of this database might require
patterns [21]. Diagnosis involves determination of errors combining data from different sources.
or missing data and any true extremes and true normal l Step 4—Apply the framework for data cleaning: Dur-
[21]. Editing involves correction or deleting of any iden- ing this step, the three data cleaning phases (screen-
tified errors [21]. The various phases in Van den Broeck ing, diagnosis, and treatment) in Van den Broeck et
et al.’s framework have also been applied in various set- al.’s framework are applied on the data set created.
tings [46, 47]. Human-driven approaches complemented m Step 5—Analyze the data: This step provides a
by automatic approaches were also used in the various summary of the data quality issues discovered, the
data cleaning phases in thus study. Human-involvement eliminated data after the treatment exercise, and the
in data cleaning has also been advocated in other studies retained final data set on which analyses can then be
[35]. done.

Study setting Application of data cleaning process: Kenya HIV‑indicator


This study was conducted in Kenya, a country in East reporting case example
Africa. Kenya adopted DHIS2 for use for its national In this section, we present the application of the data
reporting in 2011 [7]. The country has 47 administrative cleaning sequence above using Kenya as case example.
counties, and all the counties report a range of healthcare It is worth noting that in this study, the terms ‘program-
indicator data from care facilities and settings into the matic area report’ and ‘report’ are used interchangeably
DHIS2 system. For the purposes of this study, we focused as they contain the same meaning given that a report rep-
specifically on HIV-indicator data reported within Ken- resents a programmatic area, and contains a number of
ya’s DHIS2 system, given that these are the most compre- indicators.
hensively reported set of indicators into the system.
Kenya’s DHIS2 has enabled various quality mecha- Step 1: Outline the analyses or evaluation questions
nisms to deal with HIV data. Some of these include data and goals
validation rules, outlier analysis and minimum and maxi- For this reference case, DHIS2 data had to undergo the
mum ranges, which have been implemented at the point data cleaning process prior to use of the data for an
of data entry. DHIS2 data quality tool is also an applica- evaluation question on ‘Performance of health facili-
tion that was included in DHIS2 to supplement the in- ties at meeting the completeness and timeliness facility
built data quality mechanisms [12]. Nonetheless it was reporting requirements by the Kenyan Ministry of Health
not actively in use during our study period 2011–2018. (MoH)’. The goal was to identify the best performing and
The quality mechanisms as well as the DHIS2 quality tool poor performing health facilities at reporting within the
consider intrinsic data quality aspects. country, based on completeness and timeliness in sub-
mitting their reports into DHIS2.
This study only attempts to clean the data for further
Data cleaning process subsequent analyses. Thus, the actual analyses and eval-
Adapting the Van den Broeck et al.’s framework, a step- uation will be conducted using the final clean data in a
by-step approach was used during extraction and clean- separate study.
ing of the data from DHIS2. These steps are generic
and can be replicated by others conducting robust data Step 2: Description of data and study variables
cleaning on DHIS2 for analyses. These steps are outlined HIV-indicator data in Kenya are reported into DHIS2 on
below: a monthly basis by facilities offering HIV services using
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 5 of 15

the MOH-mandated form called “MOH 731- Compre- expected number of reports multiplied by 100 (Percent-
hensive HIV/AIDS Facility Reporting Form” (MOH731). age RRT = actual number of reports submitted on time/
As of 2011–2018, MOH 731 consisted of six program- expected number of reports * 100). Annual reports were
matic areas representing six independent reports con- therefore generated from DHIS2 consisting of percentage
taining HIV-indicators to be reported [see Additional Reporting Rate and Reporting Rate on Time, which were
file 1]. The six reports and the number of indicators extracted per facility, per year.
reported in each include: (1) HIV Counselling and Test-
ing (HCT)—14 indicators; (2) Prevention of Mother-to- Step 3: Create the data set
Child transmission (PMTCT)—40 indicators; (3) Care After obtaining Institutional Review and Ethics Commit-
and Treatment (CrT)—65 indicators; (4) Voluntary tee (IREC) approval for this work, we set out to create
Medical Male Circumcision (VMMC)—13 indicators; (5) our database from three data sources as outlined below:
Post-Exposure Prophylaxis (PEP)—14 indicators; and (6)
Blood Safety (BS)—3 indicators. (1) Data Extracted from DHIS2: Two sets of data were
Each facility offering HIV services is expected to sub- extracted from DHIS2 to Microsoft Office Excel
mit reports with indicators every month based on the (version 2016). For the first data set, we extracted
type(s) of services offered by that facility. Monthly due variables from DHIS2 for all HIV programmatic
date for all reports are defined by the MoH, and the infor- area reports submitted from all health facilities in
mation on the expected number of reports per facility. all 47 counties in Kenya between the years 2011
For our use case, we wanted to create a data set for sec- and 2018, with variables grouped by year. Vari-
ondary analyses, which was to determine performance ables extracted from DHIS2 by year included: facil-
of facilities at meeting the MoH reporting requirements ity name, programmatic area report (e.g. Blood
(facility reporting completeness and timeliness of report- Safety), expected number of reports, actual number
ing). Hence, retain only facilities offering services for any of submitted reports, actual number of reports sub-
of the six programmatic areas. Completeness in report- mitted on time, cumulative Reporting Rate by year
ing by facilities within Kenya’s DHIS2 is measured as a (calculated automatically in DHIS2) and cumula-
continuous variable starting at 0% to 100% and identi- tive Reporting Rate on Time by year (calculated
fied within the system by a variable called ‘Reporting automatically in DHIS2) [see Additional file 2]. The
Rate (RR)’. The percentage RR is calculated automatically extracted data for Reporting Rate and Reporting
within DHIS2 as the actual number of reports submit- Rate on Time constituted to the annual reports in
ted by each facility into DHIS2 divided by the expected the six programmatic areas for years 2011–2018, for
number of reports from the facility multiplied by100 the respective health facilities.
(Percentage RR = actual number of submitted reports/
expected number of reports * 100). Given that MOH731 For the second data set, we extracted the HIV-indi-
reports should be submitted by facilities on a monthly cator data elements submitted within each annual
routine, the expected number of monthly reports per programmatic area report by the health facilities for
programmatic area per year is 12 (one report expected all the six programmatic areas for every year under
per month). It should be noted that this Reporting Rate evaluation [see Additional file 1].The annual report
calculation only looks at report submission and not the contained cumulative HIV-indicator data elements
content within the reports. Given that facilities offer- gathered in each programmatic area per facility, per
ing any of the HIV services are required to submit the year.
full MOH731 form containing six programmatic area
reports, zero (0) cases are reported for indicators where In addition, extracting the aforementioned datasets
services are not provided, which appear as blank reports from 2011 to 2018 resulted to repeated occurrence
in DHIS2. As such, a report may be submitted as blank or of the facility variable in the different years. For
have missing indicators but will be counted as complete example, facilities registered in DHIS2 in 2011 will
(facility reporting completeness) simply because it was appear in subsequent years resulting to eight occur-
submitted. Timeliness is calculated based on whether rences within the 8 years (2011–2018) per program-
the reports were submitted by the 15th day of the report- matic area report (e.g. Blood Safety). These resulted
ing month as set by the MoH. Timeliness is represented to a facility containing the following variables per
in DHIS2 as ‘Reporting Rate on Time (RRT)’ and is also row: facility name, year, percentage Reporting Rate,
calculated automatically. The percentage RRT for a facil- and percentage Reporting Rate on Time for the six
ity is measured as a percentage of the actual number of programmatic area reports. In this study, the facility
reports submitted on time by the facility divided by the data per row was referred to as ‘facility record’.
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 6 of 15

(2) Facility Information: We augmented the DHIS2 cycle resulting in a new data set. Details of the data clean-
data with detailed facility information derived ing process is outlined in Fig. 2.
from Kenya Master Facility List (KMFL). This
information included facility level (II–VI), facil- a) Screening phase
ity type (such as dispensary, health center, medical
clinic) and facility ownership (such as private prac- During the screening phase, five types of oddities need
tice, MoH-owned, owned by a non-governmental to be distinguished, namely: lack or excess of data; out-
organization). lier (data falling outside the expected range); erroneous
(3) Electronic Medical Record Status: We used the inliers; strange patterns in distributions and unexpected
Kenya Health Information Systems (KeHIMS) list, analysis results [21].
which contains electronic medical records (EMR) For determining errors, we used Reporting Rate and
implemented in health facilities in Kenya, to incor- Reporting Rate on Time as key evaluation variables.
porate information on whether the facility had an Reporting Rate by itself only gives a sense of the pro-
EMR or not. Information from these three sources portion of expected reports submitted but does not
were merged into a single data set as outlined in evaluate whether exact HIV-indicator data elements are
Fig. 1. included within each report. To evaluate completion of
HIV-indicator data elements within each of the program-
matic area reports that were submitted, we created a new
variable named ‘Cumulative Percent Completion (CPC)’.
Step 4: Application of the framework for data cleaning Using the annual report extracted for HIV-indicator
Figure 2 outlines the iterative cleaning process we applied data elements per facility, Cumulative Percent Comple-
adapting Van den Broeck et al.’s framework. Data clean- tion was calculated by counting the number of non-blank
ing involved repeated cycles of screening, diagnosis, and values and dividing this by the total number of indica-
treatment of suspected data abnormalities, with each tors for each programmatic area. As such, if a facility has

Fig. 1 Creation of the evaluation data set


Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 7 of 15

reported on 10 out of 40 indicators in an annual report, it


will have 25 percent on completeness. Therefore, Cumu-
lative Percent Completion provides an aggregate annual
summary of the proportion of expected indicator values
that are completed within submitted reports. The results
for Cumulative Percent Completion were then included
as variables in the facility-records, described in step 3,
section 1. This resulted to a facility-record containing the
following variables per row: facility name, year, percent-
age Reporting Rate, percentage Reporting Rate on Time
and Cumulative Percent Completion for the six program-
matic areas.

b Diagnostic phase

The diagnostic phase enables clarification of the true


nature of the worrisome data points, patterns, and sta-
tistics. Van den Broeck et al. posits possible diagnoses
for each data point as: erroneous, true extreme, true
normal or idiopathic (no diagnosis found, but data
still suspected to having errors) [21]. We used a com-
bination of Reporting Rate, Reporting Rate on Time
and Cumulative Percent Completion to detect various
types of situations (errors or no errors) for each facil-
ity per annual report (Table 1). Using the combination
of Cumulative Percent Completion, Reporting Rate,
and Reporting Rate on Time we were able to categorize
the various types of situations to be used in diagnosis
for every year a facility reported into DHIS2 (Table 1).
In this table, “0” represents a situation where percent-
Fig. 2 Repeated cycles of data cleaning
age is zero; “X” represents a situation where percent-
age is above zero; and “> 100%” represents a situation
where percentage is more than 100. This data points

Table 1 Categorization of the various situations within DHIS2 and actions taken
Situation CPCa RRb RRT​c Diagnosis Action

A 0 0 0 Nothing was reported by facilities during this period, signifying that the facility does Facility records excluded
not report to DHIS2. This could be a true normal
B 0 X X Submitted reports might be on time, but are empty. Can result from programs want- Facility records excluded
ing to have full MOH731 submission even though they do not offer services in all
the 6 programmatic areas—hence submitting empty reports from non-required
programmatic areas
(Report is useless to decision-maker as it is empty)
C 0 X 0 Submitted reports are empty and not on time (Report is useless to decision-maker as Facility records excluded
it is empty and not on time)
D X 0 0 No values present for RR and RRT. However, the reports are not empty Facility records excluded
E X > 100% X Erroneous records as percentage RR cannot go beyond 100 as this is not logically Facility records excluded
possible
F X > 100% > 100% Erroneous records percentage RR and RRT cannot go beyond 100 as this is not logi- Facility records excluded
cally possible
G X X X Reports submitted on time with relevant indicators included. Ideal situation Facility records included
H X X 0 Submitted reports with data elements in them, but not submitted in a timely manner Facility records included
a
CPC cumulative percent completion, bRR reporting rate, cRRT​ reporting rate on time
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 8 of 15

were considered as erroneous records as the percentage


reporting rate cannot go beyond 100 as this is not logi-
cally possible. Based on the values per each of the three Step 5: Data analysis
variables, it was possible to diagnose the various issues The facility-records were then disaggregated to form six
within DHIS2 (Diagnosis Column). individual data sets representing each of the program-
For each programmatic area report (e.g. Blood matic areas containing the following attributes: facility
Saftey) we categorized facilities by year and variables. name, year, Cumulative Percent Completion, percentage
All health facilities with an average Cumulative Per- Reporting Rate and percentage Reporting Rate on Time,
cent Completion, Reporting Rate, and Reporting Rate as well as the augmented data on facility information and
on Time of zero (0) across all reports were identified as EMR status. The disaggregation was because facilities
not having reported for the year and were henceforth offer different services and do not necessarily report indi-
excluded – as demonstrated by examples of Facility A cators for all the programmatic areas. SPSS was used to
and B in Table 2. analyze the data using frequency distributions and cross
Beyond categorization of the various situations by tabulations in order to screen for duplication and outli-
report type, facility and year as defined above, errors ers. Individual health facilities with frequencies of more
related to duplicates were also identified using two than eight annual reports for a specific programmatic
scenarios. The first scenario of duplicates included a area were identified as duplicates. The basis for this is
situation where health facilities had similar attributes that the maximum annual reports per specific program-
such as year, name and county, with different data for matic area for an individual health facility has to be eight,
Reporting Rate and Reporting Rate on Time. The sec- given that data was extracted within an eight-year period.
ond scenario of duplicates involves a situation where From the cross tabulations, percentage Reporting Rate
health facilities had similar attributes such as year, and percentage Reporting Rate on Time that were above
name and county, with similar data for Reporting Rate, 100% were identified as erroneous records.
and Reporting Rate on Time. After the multiple iterations of data cleaning as per
Fig. 2, where erroneous data were removed by situation
c Treatment phase type (identified in Table 1), a final clean data set was
available and brought forward to be used in a separate
This is the final stage after screening and diagnosis, study for subsequent secondary analyses (which include
and entails deciding on the action point of the prob- answering the evaluation question in step 1). At the end
lematic records identified. Van den Broeck et al. limit of the data cleaning exercise, we determined the per-
the action points to correcting, deleting or leaving centage distribution of the various situation types that
unchanged [21]. Based on the diagnosis illustrated in resulted in the final data set. The percentages were cal-
Table 1, facility-records in situation A-F were deleted culated by dividing the number of facility-records in each
hence excluded from the study. Duplicates identified in situation type by the total facility-records in each pro-
the scenarios mentioned were also excluded from the grammatic area respectively, which was then multiplied
study. As such, for duplicates where health facilities by 100. As such, only data sets disaggregated into the six
had similar attributes such as year, name, and county, programmatic areas were included in the analysis. Using
with different data for Reporting Rate, and Reporting this analysis and descriptions from Table 1, we selected
Rate on Time, all entries were deleted. For duplicates situation B, and situation D, in order to determine if there
where health facilities had similar attributes such as is a difference in distribution of facility records contain-
year, name, and county, with similar data for Reporting ing the selected situation types in the six programmatic
Rate, and Reporting Rate on Time, only one entry was areas across the 8 years (2011–2018).
deleted. Only reports in situation G and H were consid- This will enable comparing distribution of facility
ered ideal for the final clean data set. records by programmatic area categorized by situation B

Table 2 Example of sectional illustration of first data set containing facility records
Year Organisation unit CPC-HCT RR-HCT RRT-HCT CPC-BS RR-BS RRT-BS ** Avg-CPC Avg-RR Avg-RRT​

2016 Facility A 0 0 0 0 0 0 0 0 0 0
2016 Facility B 0 0 0 0 0 0 0 0 0 0
2017 Facility C 10 90 80 100 90 80 0 50 60 50
CPC cumulative percentage completion, RR-HCT reporting rate HIV counselling and testing, RRT​reporting rate on time, BS blood safety, Avg average, ** remaining four
reports with the same variable sequence
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 9 of 15

and situation D. The data contains related samples and is PMTCT (4.88) CrT (2.00), VMMC (3.00), PEP and BS
not normally distributed. Therefore, a Friedman analysis (1.63). Post hoc tests presented in Table 4 also reveal that
of variance (ANOVA) was conducted to examine if there PEP had higher distribution of facility records in situa-
is a difference in distribution of facility reports by pro- tion B (0XX) in all the eight years.
grammatic area across all years N = 8 (2011–2018) for the Friedman Tests results for distribution of records
selected situation types. As such, the variables analyzed with situation D (X00) reveal that PMTCT and CrT
include year, situation type, programmatic area, and unit had the highest mean rank of 5.88 and 5.13 respectively
of analysis include number of records in each situation compared to the other programmatic areas CT (3.00),
type for a programmatic area. The distribution of facility- VMMC (3.06), PEP (2.88) and BS (1.06). Post hoc tests
records was measured in all the six programmatic areas presented in Table 5 reveal that PMTCT and CrT had
across the eight years and categorized by situation type. higher distribution of facility records in situation D (X00)
Wilcoxon Signed Rank Test were carried out as post hoc in all the 8 years.
tests to compare significances in facility report distribu-
tion within the programmatic areas. Discussion
Below, we report on findings from the iterative data Systematic data cleaning approaches are salient in iden-
cleaning exercise and the resulting clean data set. The tifying and sorting issues within the data resulting to a
results further illustrate the value of the data cleaning clean data set that can be used for analyses and decision-
exercise. making [21]. This study presents the methods and results
of systematic and replicable data cleaning approach
Results employed on routine HIV-indicator data reports in prep-
Figure 3 reports the various facility records at each cycle aration for secondary analyses.
of the data cleaning process and the number (proportion) For data stored in DHIS2, this study assumed that
of excluded facility-records representing data with errors the inbuilt data quality mechanisms dealt with the pre-
at each cycle. defined syntactical data quality aspects such as validation
The proportion of the resultant dataset after removal rules. As such, the contextual approach to data cleaning
of the various types of errors from the facility records was employed on extracted data from DHIS2 with the
is represented in Table 3. A breakdown of reporting by aim of distinguishing noise (data that are not relevant for
facilities in descending order based on facility records intended use or of poor quality), from relevant data as
retained after cleaning in dataset 4 is as follows; 93.98% presented by the various situations in Table 1. As dem-
were retained for HIV Counselling and Testing (HTC), onstrated in this study, identifying various issues within
83.65% for Prevention of Mother to Child Transmission the data may require a human-driven approach as inbuilt
(PMTCT), 43.79% for Care and Treatment (CRT), 22.10% data quality checking mechanisms within systems may
for Post Exposure Prophylaxis (PEP), 0.66% for Volun- not have the benefit of a particular knowledge. Further-
tary Medical Male Circumcision (VMMC), and 0.46% for more, these human augmented processes also facilitated
Blood Safety (BS). diagnosis of the different issues, which would have gone
Situations where data was present in reports, but no unidentified. For instance, our domain knowledge about
values present for Reporting Rate and Reporting Rate health facility HIV reporting enabled us to identify the
on Time (Situation D); and scenarios with empty reports various situations described in Table 1. This entailed
(Situation B) were analyzed (Fig. 4). This was in order to examining more than one column at a time of manually
examine whether there are differences in distribution of integrated databases and using the domain knowledge in
facility records by programmatic area across the eight making decisions on actions to take on the data set (treat-
years, categorized by situation type. Most facilities sub- ment phase). Similarly, Maina et al. also used domain
mitted PEP empty reports (18.04%) based on data set 4 as knowledge on maternal and child bearing programmes
shown in Fig. 4. in adjusting for incomplete reporting [48].In addition,
Overall Friedman Tests results for distribution of descriptive statistics such as use of cross tabulations and
records with situation B and situation D in the various frequency counts complemented the human-driven pro-
programmatic areas reveal statistically significant dif- cesses, in order to identify issue within the data such as
ferences in facility record distribution (p = 0.001) across erroneous records (screening phase).
the eight years. Specific mean rank results categorized by The use of Cumulative Percent Completeness (CPC)
error type are described in subsequent paragraphs. in this study facilitated screening and diagnosis of prob-
Friedman Tests results for empty reports (Situation lematic issues highlighted in similar studies that are con-
B) reveal that PEP had the highest mean rank of 6.00 sistent with our findings. These include identifying and
compared to the other programmatic areas CT (3.50), dealing with non-reporting facilities (situation A), and
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 10 of 15

Fig. 3 Data cleaning process

non-service providing facilities (situation B and C) in a zeros as identified in other studies [16–19, 49]. As such,
data set [19, 48]. This comes about as some of the reports DHIS2 is unable to distinguish between missing values
extracted contain blanks, as DHIS2 is unable to record and true zero values. Therefore, facilities containing such
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 11 of 15

Table 3 Proportion of facility records (2011–2018) by programmatic area in the various situations based on facility
records in dataset 4 (n = 42,007)
Situation Facility records by programmatic area
HCT (%) PMTC (%) CrT (%) VMMC (%) PEP (%) BS (%)

B(0XX) 2.68 6.15 1.32 2.81 18.04 1.70


C(0X0) 0.75 0.75 0.32 1.13 0.76 0.19
D(X00) 0.66 1.97 1.66 0.78 0.71 0.09
G(XXX) 92.44 81.52 42.60 0.63 21.82 0.45
H(XX0) 1.57 2.13 1.20 0.03 0.28 0.01
Duplicates 0.02 0.00 0.01 0.00 0.00 0.00
Total facility records (based on data set 4) 100.00 100.00 100.00 100.00 100.00 100.00
Total facility records removed 6.02 16.35 56.21 99.34 77.90 99.54
Total facility records retained 93.98 83.65 43.79 0.66 22.10 0.46
Situation-Detailed explanation of the various reporting situations within DHIS2 can be found in Table 1

Fig. 4 Distribution of facility records based on situation B (empty reports) and situation D against programmatic area

records either are assumed to not be providing the par- Maiga et al. posit that non-reporting facilities are
ticular service in question or are non-reporting facilities often assumed not to be providing any services given
(providing services but not reporting or not expected to that reporting rates are often ignored in analyses [13].
provide reports). With this in mind, this study considered various factors
In most cases, such records are often excluded from prior to exclusion of non-reporting facility records. This
the analyses [19, 48], as was the approach applied in this include identifying whether there were any successful
study. Furthermore, non-service providing facilities were report submissions in the entire year, and whether the
excluded on the basis that they may provide inaccurate submitted reports contained any data in the entire year.
analyses for the evaluation question described in step1. Therefore, facilities with records that did not meet this
This is on the basis that analyses may portray facilities as criteria (situation A, B, and C) were considered as non-
having good performance in facility reporting complete- service providing in the respective programmatic areas.
ness and timeliness; hence give a wrong impression as Further still, another finding consistent with similar
no services were provided in a particular programmatic studies is that of identifying and dealing with incom-
area (situation B and C). As such, even though a report plete reporting, which can be viewed from various per-
was submitted on time by a facility, it will not be of ben- spectives. This can include a situation where a report
efit to a decision-maker as the report has no indicators for a service provided has been successfully submit-
(is empty). Nonetheless, it is worth noting that reporting ted but is incomplete [17, 19, 48]; or missing reports
facilities considered to be providing HIV services but had (expected reports have not been submitted consistently
zero percent in timeliness were retained as these records for all 12 months), hence making it difficult to identify
were necessary for the subsequent analyses. whether services were provided or not, in months were
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 12 of 15

Table 4 Results for Wilcoxon signed rank test for distribution of records in situation B
Situation B -Empty reports (0XX)
Pairwise comparison Wilcoxon signed ranks test Wilcoxon signed ranks test Distribution of records in situation B based
by programmatic area (P value) (Z value) on pairwise comparison by programmatic
area

PMTCT—HCT 0.012 − 2.521 Higher in PMTCT for 8 years


CrT—HCT 0.036 − 2.100 Lower in CrT for 6 years
PEP—HCT 0.012 − 2.521 Higher in PEP for 8 years
BS—HCT 0.012 − 2.524 Lower in BS for 8 years
CrT—PMTCT​ 0.017 − 2.521 Lower in CrT for 7 years
VMMC—PMTCT​ 0.012 − 2.521 Lower in VMMC for 8 years
PEP—PMTCT​ 0.012 − 2.521 Higher in PEP for 8 years
BS—PMTCT​ 0.012 − 2.524 Lower in BS for 8 years
VMMC—CrT 0.050 − 1.960 Higher in VMMC for 6 years
PEP—CrT 0.012 − 2.521 Higher in PEP for 8 years
PEP—VMMC 0.012 − 2.521 Higher in PEP for 8 years
BS—VMMC 0.012 − 2.524 Lower in BS for 8 years
BS—PEP 0.012 − 2.521 Lower in BS for 8 Years
PMTCT​prevention of mother to child transmission, HCT HIV counselling and testing, PEP post-exposure prophylaxis, BS blood saftey, CrT care and treatment, VMMC
voluntary medical male circumcision

Table 5 Results for Wilcoxon signed rank test for distribution of facility records in situation D (X00)
Situation D (X00)
Pairwise comparison Wilcoxon signed ranks test Wilcoxon signed ranks test Distribution of records in situation D based
by programmatic area (P value) (Z value) on pairwise comparison by programmatic
area

PMTCT—HCT 0.012 − 2.521 Higher in PMTCT for 8 years


CrT—HCT 0.012 − 2.521 Higher in CrT for 8 years
BS—HCT 0.012 − 2.524 Lower in BS for 8 years
VMMC—PMTCT​ 0.012 − 2.521 Lower in VMMC for 8 years
PEP—PMTCT​ 0.012 − 2.521 Lower in PEP for 8 years
BS—PMTCT​ 0.012 − 2.521 Lower in BS for 8 years
VMMC—CrT 0.012 − 2.524 Lower in VMMC for 8 years
PEP—CrT 0.012 − 2.527 Lower in PEP for 8 years
BS—CrT 0.012 − 2.524 Lower in BS for 8 years
BS—VMMC 0.018 − 2.375 Lower in BS for 8 years
BS—PEP 0.012 − 2.524 Lower in BS for 8 years
PMTCT​prevention of mother to child transmission, HCT HIV counselling and testing, CrT care and treatment, PEP post-exposure prophylaxis, BS blood safety, VMMC
Voluntary Medical Male Circumcision

reports were missing [48]. Whereas some studies retain the other hand, Maina et al. opted to adjust for incom-
these facility records, others opt to make adjustments for plete reporting, apart from where missing reports were
incomplete reporting. Maiga et al. posit that these adjust- considered an indication that no services were provided
ments need to be made in a transparent manner when [48]. Furthermore, a number of studies in DHIS2 have
creating the new data set with no modifications made on identified duplicate records [16, 18, 19], with removal or
the underlying reported data [13]. exclusion as the common action undertaken to prepare
In this study, all facility records were included (situation the data set for analyses. These findings thus demonstrate
G and H) irrespective of incomplete reporting, which was duplication as a prevalent issue within DHIS2 [16, 18, 19,
similar to the approach taken by Thawer et al. [19]. On 49].
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 13 of 15

Whereas studies using DHIS2 data have found it nec- studies [24]. Moreover, data cleaning for large data sets
essary to clean the extracted data prior to analyses [16, can also be time consuming. Nonetheless, identifying
18, 19], transparent and systematic approaches are still and understanding issues within the data using a human-
lacking in literature [20]. Given that contexts were data driven approach provides better perspective prior to
is being used vary, there is no one-size fits all solution to developing automatic procedures, which can then detect
data cleaning, considering the many existing approaches the identified issues. Therefore, there is need for devel-
as well as the subjective component of data quality [25, oping automated procedures or tools for purposes of
26]. As such, transparent and systematic documentation detecting and handling the different situation types in
of procedures is valuable as it also increases the validity Table 1.
in research [21]. Moreover, existing literature advocates DHIS2 incorporated a quality tool, which used a simi-
the need for clear and transparent description of data set lar concept as that used in calculating Cumulative Per-
creation and data cleaning methods [9, 21, 22]. Therefore, cent Completion in this study, to flag facilities with more
the generic five-step approach developed in this study is a than 10 percent zero or missing values in the annual
step toward the right direction as it provides a systematic report [12]. Based on this, we recommend that facilities
sequence that can be adopted for cleaning data extracted with 100 percent zero or missing values also be flagged
from DHIS2. in the annual report in order to identify empty reports,
In addition, the statistical analysis employed such as as well situation where Reporting Rate on Time is zero in
non-parametric tests provide an overview of distribution the annual report. Further still automated statistical pro-
of facility records containing quality issues within the cedures can be developed within the system to perform
various programmatic areas, hence necessitating need for various analyses such as calculating the number of empty
further investigations where necessary. These statistics reports submitted by a facility for a sought period of time,
also provided a picture of the most reported program- per programmatic area. This could provide beneficial
matic areas, which contain data within their reports. practical implications such as enabling decision-makers
Moreover, as revealed in the screening, diagnosis and to understand the frequency of provision of certain ser-
treatment phases presented in this paper, data clean- vices among the six programmatic areas within a particu-
ing process can be time consuming. Real-world data lar period among health facilities. We also recommend
such as the DHIS2 data and merging of real-world data for measures to be established within DHIS2 implemen-
sets as shown in this paper may be noisy, inconsist- tations to ensure that cases reported as zero appear in
ent and incomplete. In the treatment stage, we present DHIS2.
the actions taken to ensure that only meaningful data Such findings could be used to improve the quality of
is included for subsequent analysis. Data cleaning also reporting. Automatic procedures should also be accom-
resulted to a smaller data set than the original as dem- panied by data visualizations, and analyses, integrated
onstrated in the results [29]. As such, the final clean data within the iterative process in order to provide insights
set obtained in this study is more suitable for its intended [35]. In addition, user engagement in development of
use than in its original form. automatic procedures and actively training users in iden-
A limitation in this study was inability to determine tifying and discovering various issues within the data
the causality of some of the issues encountered. Whereas may contribute to better quality of data [35, 37].
quality issues are in part attributed to insufficient skills
or data entry errors committed at the facility level [14], Conclusion
some of the issues encountered from our findings (such Comprehensive, transparent and systematic reporting of
as duplication, situation E and F) are assumed to be cleaning process is important for validity of the research
stemming from within the system. Nonetheless, there is studies [21]. The data cleaning included in this article was
need for further investigation on causality. In addition, semi-automatic. It complemented the automatic proce-
given that situation D was identified as a result of merg- dures and resulted in improved data quality for data use
ing two data sets extracted from DHIS2, it was expected in secondary analyses, which could not be secured by the
that if reports contain indicator data, then their respec- automated procedures solemnly. In addition, based on
tive Reporting Rate and Reporting Rate on Time should our knowledge, this was the first systematic attempt to
be recorded. Nonetheless, it was also not possible within transparently report on the developed and applied data
the confines of this study to identify the causality for situ- cleaning procedures for HIV-indicator data reporting
ation D. As such, further investigations are also required. in DHIS2 in Kenya. Furthermore, more robust and sys-
In addition, there are also limitations with human aug- tematic data cleaning processes should be integrated to
mented procedures as human is to error especially when current inbuilt DHIS2 data quality mechanisms to ensure
dealing with extremely large data sets as posited by other highest quality data.
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 14 of 15

Supplementary information countries? Assessing the evidence base. Adv Health Care Manag.
2012;12:25–58.
Supplementary information accompanies this paper at https​://doi.
2. De Lay PR. Nicole Massoud DLR, Carae KAS and M. Strategic information
org/10.1186/s1291​1-020-01315​-7.
for HIV programmes. In: The HIV pandemic: local and Global Implications.
Oxford Scholarship Online; 2007. p. 146.
Additional file 1. Programmatic areas (reports) with respective indica- 3. Beck EJ, Mays N, Whiteside A, Zuniga JM. The HIV Pandemic: Local and
tors as per MOH 731- Comprehensive HIV/AIDS Facility Reporting Form Global Implications. Oxford: Oxford University Press; 2009. p. 1–840.
extracted from DHIS2. 4. Granich R, Gupta S, Hall I, Aberle-Grasse J, Hader S, Mermin J. Status and
methodology of publicly available national HIV care continua and 90–90-
Additional file 2. Facility report submission data extracted from DHIS2. 90 targets: a systematic review. PLoS Med. 2017;14:e1002253.
5. Peersman G, Rugg D, Erkkola T, Kirwango E, Yang J. Are the investments
in monitoring and evaluation systems paying off? Jaids. 2009;52(Suppl
Abbreviations 2):8796.
BS: Blood safety; CPC: Cumulative percent completion; CrT: Care and treat- 6. Kariuki JM, Manders E-J, Richards J, Oluoch T, Kimanga D, Wanyee S,
ment; DHIS2: District Health Information System Version 2; EMR: Electronic et al. Automating indicator data reporting from health facility EMR to a
medical record; HIV: Human immunodeficiency virus; HCT: HIV counselling national aggregate data system in Kenya: an Interoperability field-test
and testing; KeHMS: Kenya Health Management System; KMFL: Kenya Master using OpenMRS and DHIS2. Online J Public Health Inform. 2016;8:e188.
Facility List; LMICs: Low-and middle-income countries; MOH: Ministry of 7. Karuri J, Waiganjo P, Orwa D, Manya A. DHIS2: the tool to improve health
Health; NGO: Non-Governmental Organization; PEP: Post-exposure prophy- data demand and use in Kenya. J Health Inform Dev Ctries. 2014;8:38–60.
laxis; PMTCT​: Prevention of mother to child transmission; RHIS: Routine health 8. Dehnavieh R, Haghdoost AA, Khosravi A, Hoseinabadi F, Rahimi H,
information systems; RR: Reporting rate; RRT​: Reporting rate on time; VMMC: Poursheikhali A, et al. The District Health Information System (DHIS2):
Voluntary Medical Male Circumcision. a literature review and meta-synthesis of its strengths and operational
challenges based on the experiences of 11 countries. Health Inf Manag.
Acknowledgements 2019;48:62–75.
Not applicable. 9. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al.
The REporting of studies Conducted using Observational Routinely-col-
Disclaimer lected health Data (RECORD) Statement. PLOS Med. 2015;12:e1001885.
The findings and conclusions in this report are those of the authors and do 10. Dziadkowiec O, Callahan T, Ozkaynak M, Reeder B, Welton J. Using a data
not represent the official position of the Ministry of Health in Kenya. quality framework to clean data extracted from the electronic health
record: a case study. eGEMs. 2016;4(1):11.
Authors’ contributions 11. Dhis2 Documentation Team. Control data quality. DHIS2 user manual.
MG, AB, and MW designed the study. AB and MW supervised the study. 2020 https​://docs.dhis2​.org/2.31/en/user/html/dhis2​_user_manua​l_en_
MG and AB analyzed the data. MG wrote the final manuscript. All authors full.html#contr​ol_data_quali​ty. Accessed 10 Oct 2020.
discussed the results and reviewed the final manuscript. All authors read and 12. Haugen JÅ, Hjemås G, Poppe O. Manual for the DHIS2 quality tool. Under-
approved the final manuscript. standing the basics of improving data quality. 2017. https​://ssb.brage​
.unit.no/ssb-xmlui​/handl​e/11250​/24608​43. Accessed 30 Jan 2020.
Funding 13. Maïga A, Jiwani SS, Mutua MK, Porth TA, Taylor CM, Asiki G, et al. Gen-
This work was supported in part by the NORHED program (Norad: Project erating statistics from health facility data: the state of routine health
QZA-0484). The content is solely the responsibility of the authors and does information systems in Eastern and Southern Africa. BMJ Global Health.
not represent the official views of the Norwegian Agency for Development 2019;4:e001849.
Cooperation. 14. Gloyd S, Wagenaar BH, Woelk GB, Kalibala S. Opportunities and chal-
lenges in conducting secondary analysis of HIV programmes using data
Availability of data and materials from routine health information systems and personal health informa-
The data sets generated during the current study are available in the national tion. J Int AIDS Soc. 2016;19(Suppl 4):1–6.
District Health Information Software 2 online database, https​://hiske​nya.org/. 15. Fan W, Geerts F. Foundations of data quality management. Synth Lect
Data Manag. 2012;4:1–217.
Ethics approval 16. Githinji S, Oyando R, Malinga J, Ejersa W, Soti D, Rono J, et al. Complete-
Ethical approval for this study was obtained from the Institutional Review and ness of malaria indicator data reporting via the District Health Informa-
Ethics Committee (IREC) Moi University/Moi Teaching and Referral Hospital tion Software 2 in Kenya, 2011–2015. BMC Malar J. 2017;16:1–11.
(Reference: IREC/2019/78). 17. Wilhelm JA, Qiu M, Paina L, Colantuoni E, Mukuru M, Ssengooba F, et al.
The impact of PEPFAR transition on HIV service delivery at health facilities
Consent for publication in Uganda. PLoS ONE. 2019;14:e0223426.
Not applicable. 18. Maina JK, Macharia PM, Ouma PO, Snow RW, Okiro EA. Coverage of
routine reporting on malaria parasitological testing in Kenya, 2015–2016.
Competing interests Glob Health Action. 2017;10:1413266.
The authors declare that they have no competing interests. 19. Thawer SG, Chacky F, Runge M, Reaves E, Mandike R, Lazaro S, et al. Sub-
national stratification of malaria risk in mainland Tanzania: a simplified
Author details assembly of survey and routine data. Malar J. 2020;19:177.
1
Department of Information Science and Media Studies, University of Bergen, 20. Shikuku DN, Muganda M, Amunga SO, Obwanda EO, Muga A, Matete T,
Bergen, Norway. 2 Vanderbilt University Medical Center, Nashville, USA. et al. Door-to-door immunization strategy for improving access and uti-
3
Department of Biomedical Engineering, Linköping University, Linköping, lization of immunization services in hard-to-reach areas: a case of Migori
Sweden. 4 Institute of Biomedical Informatics, Moi University, Eldoret, Kenya. County, Kenya. BMC Public Health. 2019;19:1–11.
21. Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data clean-
Received: 7 April 2020 Accepted: 4 November 2020 ing: detecting, diagnosing, and editing data abnormalities. PLoS Med.
2005;2:966–70.
22. Leahey E, Entwisle B, Einaudi P. Diversity in everyday research practice:
the case of data editing. Sociol Methods Res. 2003;32:64–89.
23. Wang RY, Strong DM. Beyond accuracy: what data quality means to data
References consumers. J Manag Inf Syst. 1996;12:5–33.
1. Hotchkiss DR, Diana ML, Foreit KGF. How can routine health information
systems improve health systems functioning in lowand middle-income
Gesicho et al. BMC Med Inform Decis Mak (2020) 20:293 Page 15 of 15

24. Langouri MA, Zheng Z, Chiang F, Golab L, Szlichta J. Contextual data 40. Measure Evaluation. User Manual Routine Data Quality Assessment RDQA
cleaning. In 2018 IEEE 34th INTERNATIONAL CONFERENCE DATA ENGI- User Manual. 2015. https​://www.measu​reeva​luati​on.org/resou​rces/tools​/
NEERING Work. 2018. p. 21–4. data-quali​ty/rdqa-guide​lines​-2015. Accessed 23 Nov 2018.
25. Strong DM, Lee YW, Wang RY. Data quality in context. Commun ACM. 41. World Health Organization. The immunization data quaity self-assess-
1997;40:103–10. ment (DQS) tool. World Health Organization. 2005 . www.who.int/vacci​
26. Bertossi L, Rizzolo F, Jiang L. Data quality is context dependent. In Lecture nes-docum​ents/. Accessed 6 Aug 2020.
notes in business information processing. 2011. p. 52–67. 42. Shanks G, Corbitt B. Understanding data quality: social and cultural
27. Bolchini C, Curino CA, Orsi G, Quintarelli E, Rossato R, Schreiber FA, et al. aspects. In: 10th Australasian conference on information systems. 1999. p.
And what can context do for data? Commun ACM. 2009;52:136–40. 785–97.
28. Chapman AD. Principles and methods of data cleaning primary species 43. Weiskopf NG, Weng C. Methods and dimensions of electronic health
data, 1st ed. Report for the Global Biodiversity Information Facility. GBIF; record data quality assessment: enabling reuse for clinical research. J Am
2005. Med Inform Assoc. 2013;20:144–51.
29. Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif 44. Savik K, Fan Q, Bliss D, Harms S. Preparing a large data set for analysis:
Intell. 2003;17:375–81. using the minimum data set to study perineal dermatitis. J Adv Nurs.
30. Fayyad U, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data 2005;52(4):399–409.
mining: towards a unifying framework. 1996. 31. 45. Miao Z, Sathyanarayanan S, Fong E, Paiva W, Delen D. An assessment and
31. Oliveira P, Rodrigues F, Galhardas H. A taxonomy of data quality problems. cleaning framework for electronic health records data. In: Industrial and
In: 2nd International work data information quality. 2005. p. 219 systems engineering research conference. 2018.
32. Li L, Peng T, Kennedy J. A rule based taxonomy of dirty data. GSTF Int J 46. Kulkarni DK. Interpretation and display of research results. Indian J
Comput. 2011. https​://doi.org/10.5176/2010-2283_1.2.52. Anaesth. 2016;60:657–61.
33. Müller H, Freytag J-C. Problems, methods, and challenges in comprehen- 47. Luo W, Gallagher M, Loveday B, Ballantyne S, Connor JP, Wiles J. Detecting
sive data cleansing challenges. Technical Report HUB-IB-164, Humboldt contaminated birthdates using generalized additive models. BMC Bioin-
University, Berlin. 2003. p. 1–23. form. 2014;12(15):1–9.
34. Seheult AH, Green PJ, Rousseeuw PJ, Leroy AM. Robust regression and 48. Maina I, Wanjal P, Soti D, Kipruto H, Droti B, Boerma T. Using health-facility
outlier detection. J R Stat Soc Ser A Stat Soc. 1989;152:133. data to assess subnational coverage of maternal and child health indica-
35. Hellerstein JM. Quantitative data cleaning for large databases. United tors, Kenya. Bull World Health Organ. 2017;95(10):683–94.
Nations Economics Committee Europe. 2008. 42. 49. Bhattacharya AA, Umar N, Audu A, Allen E, Schellenberg JRM, Marchant
36. Kang H. The prevention and handling of the missing data. Korean J Anes- T. Quality of routine facility data for monitoring priority maternal and
thesiol. 2013;64:402–6. newborn indicators in DHIS2: a case study from Gombe State, Nigeria.
37. Chu X, Ilyas IF, Krishnan S, Wang J. Data cleaning: overview and emerging PLoS ONE. 2019;14:e0211265.
challenges. In: Proceedings of the ACM SIGMOD international conference
on management of data. New York: ACM Press; 2016. p. 2201–6.
38. Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T. Arktos: a Publisher’s Note
tool for data cleaning and transformation in data warehouse environ- Springer Nature remains neutral with regard to jurisdictional claims in pub-
ments. IEEE Data Eng Bull. 2000;23:2000.1.109.2911 lished maps and institutional affiliations.
39. WHO. Data Quality Review (DQR) Toolkit . WHO. World Health Organiza-
tion; 2019: who.int/healthinfo/tools_data_analysis/en/. Accessed 5 Mar
2020.

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission


• thorough peer review by experienced researchers in your field
• rapid publication on acceptance
• support for research data, including large and complex data types
• gold Open Access which fosters wider collaboration and increased citations
• maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions

You might also like