Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
28 views1 page

Handling of Missing Values

The document describes a 5-phase algorithm for data imputation, focusing on handling missing values represented by '?'. It involves scanning through rows to find similar rows based on matching column values, and then replacing missing values with the majority value from similar rows or the overall dataset. The algorithm emphasizes the importance of counting occurrences to determine the most frequent value for imputation.

Uploaded by

ambuj.kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views1 page

Handling of Missing Values

The document describes a 5-phase algorithm for data imputation, focusing on handling missing values represented by '?'. It involves scanning through rows to find similar rows based on matching column values, and then replacing missing values with the majority value from similar rows or the overall dataset. The algorithm emphasizes the importance of counting occurrences to determine the most frequent value for imputation.

Uploaded by

ambuj.kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

It is a N-phase algorithm (N = 5 is usually sufficient, but can be increased if needed).

Each phase is made of following steps.

Scan through all rows.


for(i = 0 to numberOfRows) do
{
for(j = 0 to numberOfColumns) do {
if(data[i][j] is ‘?’) then {
Step 1: Find similar rows
- Scan through entire dataset rows EXCEPT row number ‘i’.
- If at least “half” of column values (which are NOT ?) match with this row and
column number ‘j’ is NOT ?, then include that row as a similarRow.

Step 2 (a) : If similarRows is an empty set, then


Scan through all rows except row number ‘i’. Look at value of column ‘j’. If non-
empty, then keep a count of that value for column[j].
After all rows have been scanned, then choose the value whose count is
maximum.
Replace data[i][j] with that majority value.

Step 2 (b) : If similarRows is non-empty set, then


Scan through all rows in similarRow and keep track of values in column[j].
Choose the value which occurs maximum number of times and replace data[i][j]
with that majority value.
}
}
}

You might also like