It is a N-phase algorithm (N = 5 is usually sufficient, but can be increased if needed).
Each phase is made of following steps.
Scan through all rows.
for(i = 0 to numberOfRows) do
{
for(j = 0 to numberOfColumns) do {
if(data[i][j] is ‘?’) then {
Step 1: Find similar rows
- Scan through entire dataset rows EXCEPT row number ‘i’.
- If at least “half” of column values (which are NOT ?) match with this row and
column number ‘j’ is NOT ?, then include that row as a similarRow.
Step 2 (a) : If similarRows is an empty set, then
Scan through all rows except row number ‘i’. Look at value of column ‘j’. If non-
empty, then keep a count of that value for column[j].
After all rows have been scanned, then choose the value whose count is
maximum.
Replace data[i][j] with that majority value.
Step 2 (b) : If similarRows is non-empty set, then
Scan through all rows in similarRow and keep track of values in column[j].
Choose the value which occurs maximum number of times and replace data[i][j]
with that majority value.
}
}
}