Lecture 5
Cleaning Data
In this lecture, first you will learn how to deal with inaccurate data, how to remove empty rows,
and how to remove duplicated data. Next, you will learn how to change the case of text, how to
change date formatting, and how to trim whitespace from data. Finally, you will learn how to use
the Flash Fill feature and functions in Excel to help clean data.
Software Used in this Lab
Excel Desktop Version
Dataset Used in this Lab
The dataset used in this lab comes from the following source:
https://dataplatform.cloud.ibm.com/exchange/public/entry/view/f8ccaf607372882403a37d90
19b3abf4.
This dataset is published by IBM, and includes fictitious customer demographics and sales data.
The second dataset used in this lab comes from the following source:
https://www.kaggle.com/sudalairajkumar/indian-startup-funding
Acknowledgement and thanks also goes to https://trak.in who were generous enough to share
the data publicly for free.
We are using modified subsets of these datasets for the lab, so to follow the lab instructions
successfully please use the datasets provided with the lab, rather than the datasets from their
original sources.
The third dataset used in this lab is an internal dataset.
Development Sample.
*Data is for validation purposes only.
This data set includes fictitious customer demographics and sales data. You can use it to analyze
customer demographics, such as, age, gender, income, and location, and then combine that data
with sales data to examine trends for product categories, transaction types, and product
popularity.
Objectives
After completing this lab, you will be able to:
Understand how to deal with irrelevant or inaccurate data
Remove empty rows and duplicated data
Change text case and date formatting
Trim whitespaces from data
Use Flash Fill and functions to clean data
Exercise 1: Removing Duplicated, Irrelevant or Inaccurate Data
In this exercise, you will learn how to deal with inaccurate data, how to remove empty rows, and
how to remove duplicated data.
Task A: Check spelling
1. Download the file Customer_demographics_and_sales1.xlsx and open it using Excel.
2. Select column L (CREDITCARD_TYPE), then click Review tab, and select Spelling.
3. Click the correct suggestion to change the spelling.
a. Note: Don’t change ‘jcb’ spelling when doing the spell check. We will need ‘jcb’
for the Exercise 1 Task D.
4. Close the Spelling pane.
Task B: Remove empty rows
1. Press CTRL+HOME, then press CTRL+SHIFT+END to select the whole datasheet.
2. On the Data tab, click Filter.
3. Press CTRL+HOME, click the filter arrow in the CUST_NAME column, and then click Filter.
4. Click the Select All checkbox to deselect all of them. Then select just Blanks, then OK.
5. Select first row, then press CTRL+SHIFT+END to select all rows.
6. Right-click the selected rows and then click Delete Rows.
7. Finally, on the Data tab, click Clear, then click Filter.
Task C: Remove duplicate rows
1. Select Column T (ORDER_ID) since ORDER_ID values are unique.
2. On the Home tab, click Conditional Formatting> Highlight Cells Rules> Duplicate Values,
and then click OK.
3. Select the whole datasheet (CTRL+SHIFT+END)
4. On the Data tab, click Remove Duplicates.
5. In the Remove Duplicates dialog box, ensure that Select all columns is checked and
that My data has headers is also checked, then click OK.
6. In the pop-up box informing you how many duplicate values were found and removed,
click OK.
Task D: Use Find & Replace to correct misspelling
1. On the Home tab, click Find & Select.
2. Click Find. In Find what, type jcb, and click Find All.
3. Click Replace.
4. In Replace with, type JCB, click Replace All, and then click the Close icon.
5. On the Home tab, click Conditional Formatting> Clear Rules> Clear Rules from Entire
Sheet.
Exercise 2: Dealing with Inconsistencies in Data
In this exercise, you will learn how to change the case of text, how to change date formatting,
and how to trim whitespace from data.
Task A: Use the PROPER function to change text from upper case to proper case
1. Select row 2, then right-click it and choose Insert Rows.
2. In cell A2, type =PROPER(A1) and press Enter.
3. Hover over the bottom-right corner of cell A2, and drag the Fill Handle across to the last
column.
4. If dragging across is too difficult with the mouse, then select the cells in the row 2
using SHIFT+RIGHT ARROW, then press F2 to put the cursor focus back in cell A2, then
hold CTRL while you press Enter.
5. Select row 2, then press CTRL+C.
6. Select row 1, Right-click and choose Paste Options>Values.
7. Select row 2, right-click it and choose Delete Rows.
Task B: Use the UPPER function to change text from proper case to upper case
1. Select column AG (Generation). Then right-click and choose Insert Columns. In cell AG1,
type Generation.
2. In cell AG2, type =UPPER(AH2) and press Enter.
3. Hover over the bottom-right corner of cell AG2 and double-click the Fill Handle.
4. Select column AG, then press CTRL+C.
5. Select column AH, right-click and choose Paste Options>Values.
6. Select column AG, right-click it and choose Delete Columns.
Task C: Use the LOWER function to change text from proper case to lower case
1. Select column AC (T_Type). Then right-click and choose Insert Columns. In cell AC1,
type T_Type.
2. In cell AC2, type =LOWER(AD2) and press Enter.
3. Hover over the bottom-right corner of cell AC2 and double-click the Fill Handle.
4. Select column AC, then press CTRL+C.
5. Select column AD, right-click and choose Paste Options>Values.
6. Select column AC, right-click it and choose Delete Columns.
Task D: Change date formatting
1. Select column Z (Order_Ship_Date).
2. On the Home tab, in the Number group click Number Format> More Number Formats.
3. In the Category list, select Date.
4. In the Format Cells box, under Locale, select English (United States).
5. Under Type, select Wednesday, March 14, 2012 and click OK.
Task E: Use Find & Replace to trim whitespace
1. Click CTRL+HOME.
2. Select all the data using CTRL+SHIFT+END.
3. On the Home tab, click Find & Select, then Replace.
4. In Find what, type 2 spaces. In Replace with, type 1 space.
5. Click Find All, then click Replace All.
6. Click the Close icon.
Exercise 3: More Excel Features for Cleaning Data
In this exercise, you will learn how to use the Flash Fill feature and functions in Excel to help clean
data.
Task A: Use the Flash Fill feature to clean data:
1. Select column A (Cust_Name), right-click and choose Insert Columns.
2. In cell A1 type Customer_Name and press Enter.
3. In cell A2, type Mr. Allen Perl and press Enter.
4. Select column A (Customer_Name), on the Data tab, click Flash Fill.
5. Click Undo to undo this step.
If you are using the desktop version of Excel, you could use the ‘Text to Columns’ feature to
perform this next task (see the corresponding topic video for instructions).
If you are using ‘Excel for the web’ (the online version of Excel), the ‘Text to Columns’ feature
is not available, but you can achieve the same results using functions, as shown in the steps
below.
Task B: Use LEFT, RIGHT, LEN, and SEARCH functions to clean data:
1. Select column A (Cust_Name), right-click and choose Insert Columns.
2. Select column A again, right-click and choose Insert Columns.
3. In cell A1, type Customer_Firstname and in cell B1, type Customer_Lastname.
4. Click C1, then on the Home tab, click Format Painter, then drag across to A1 and B1.
5. Double-click the divider between columns A and B.
6. In cell A2 type =LEFT(C2, SEARCH(“ “,C2,1)) and press Enter.
7. In cell B2 type =RIGHT(C2,LEN(C2)-SEARCH(“ “,C2,1)) and press Enter.
8. Double-click the Fill Handle on cell A2.
9. Double-click the Fill Handle on cell B2.
Filtering and Sorting Data
Objectives
After completing this lab, you will be able to:
Use the Filter and Sort tools
Use IF, IFS, COUNTIF, and SUMIF functions for data analysis
Exercise 4: Filtering and Sorting Data
In this exercise, you will learn how to use the Filter and Sort tools in Excel to filter and sort our
data to enable us to control what information is displayed, and how it is displayed in our
worksheets.
Task A: Filtering data
To use Auto Filters to filter data:
1. Download the file Customer_demographics_and_sales2.xlsx. Upload and open it using
Excel for the web.
2. Select any cell in the data, and click the Data tab, then click Filter.
3. Click the filter drop-down in column AG (Purchase_Status), and select Filter….
4. In the list, only select Frequent and click OK.
5. Click the filter drop-down in the column AG, and click Clear Filter From
“Purchase_Status”.
6. Click the filter drop-down in column AE (T_Type), and select Filter….
7. In the list, only select Cancelled and click OK.
8. Click the filter drop-down in column AF (Purchase_Touchpoint), and select Filter….
9. In the list, only select Desktop and click OK.
10. On the Data tab, click Clear.
To use Custom Filters to filter data:
1. Click the filter drop-down in column AD (Order_Value), then Number Filters>Top 10….
2. Change the value from 10 to 50 and Click OK.
3. Click the filter drop-down in the column AD, and click Clear Filter From “Order_Value”.
Task B: Sorting data
1. On the Data tab, click Custom Sort to open a dialog box like below.
2. Click the Column drop-down of row Sort By, select Order_Ship_Date.
3. Click the Order drop-down of row Sort By, select Sort Ascending.
4. Click Add.
5. Click the Column drop-down of row Then By, select Order_Value.
6. Click the Order drop-down of row Then By, select Sort Descending.
7. Click OK.
Exercise 5: Useful Functions for Data Analysis
In this exercise, you will learn how to use some of the most common functions a Data Analyst
might use; namely IF, IFS, COUNTIF, and SUMIF.
Task A: Use of IF to apply one condition
1. Select column AF, right-click, Insert.
2. In cell AF1, type Complete?.
3. In cell AF2, type =IF(AE2=”Complete”,”Yes”,”No”) and press Enter.
4. Double-click the Fill Handle of AF2 to copy down the column.
Task B: Use of Nested IF to apply multiple conditions
1. Select column AE, right-click, Insert.
2. In cell AE1, type Order Size (IF).
3. In cell AE2, type =IF(AD2>300,”Large”,IF(AD2>100,”Medium”,IF(AD2>0,”Small”))) and
press Enter.
4. Double-click the Fill Handle of AE2 to copy down the column.
Task C: Use of IFS to apply multiple conditions (alternative of Nested IF)
1. Select column AE, right-click, Insert.
2. In cell AE1, type Order Size (IFS).
3. In cell AE2, type =IFS(AD2>300,”Large”,AD2>100,”Medium”,AD2>0,”Small”) and
press Enter.
4. Double-click the Fill Handle of AE2 to copy down the column.
Task D: Use of COUNTIF to count the number of cells that meet a specified criterion
1. Select cell BX2 and type count VISA card.
2. Select cell BY2 and type =COUNTIF(N2:N195,”VISA”) and press Enter.
Task E: Use of SUMIF function to sum the values within a specified range that meet a specified
criterion
1. Select cell BX3 and type sum Large order.
2. Select cell BY3 and type =SUMIF(AE2:AE195,”Large”, AD2:AD195) and press Enter.
3. Formula: =SUMIF(range, criteria, [sum range]).
Task F: Use of SUMIFS function to sum the values within a specified range that meet multiple
specified criteria
1. Select cell BX4 and type sum Large order with Baby Gen.
2. Select cell BY4 and type =SUMIFS(AD2:AD195, AE2:AE195,”Large”,
AL2:AL195,”*BABY_BOOMERS*“) and press Enter.
3. Formula: =SUMIFS ([sum range], range1, criteria1, range2, criteria2, …).