Data Analytics
Data Cleaning and Formulas
WELCOME TO GA
GENERAL ASSEMBLY
Our Learning Goals
In this lesson, we’ll:
● Apply data cleaning best practices,
including working with NULLs.
● Experiment with common Excel formulas.
2 | © 2023 General Assembly
Where We Are in the DA Workflow
Wrangle/Prepare:
Clean and prepare
relevant data.
3 | © 2023 General Assembly
Discussion:
Getting to Know Your Data
The Superstore regional sales director from the central U.S. region has reached
out to you with a request:
We are seeing a high volume of returns. Can you dig into what might be
causing this?
What should we look into first?
4 | © 2023 General Assembly
Partner Exercise:
5 Minutes
Let’s Find Out!
1. Download Lesson 02_Superstore_workbook and examine the “Orders” and
“Returns” sheets.
2. Discuss with a partner which data points we should examine to determine
why return volume has increased.
3. Then, discuss where we’ll need to dig in to explain the higher volume of
returns.
5 | © 2023 General Assembly
Data Cleaning and Formulas
Importing Data for Excel
Best Practices
WELCOME TO GA
GENERAL ASSEMBLY
Importing Data | Getting Your Sandbox Ready
7 | © 2023 General Assembly
Data Set Best Practices | Resave
If you plan to analyze data in Excel,
always and immediately
convert .CSV files to .XLSX
● Go to File >> Save As
But why?
CSV (comma-separated values) is plain text,
while XLSX is a binary file format that holds
information — including both content and
formatting — on all the worksheets.
8 | © 2023 General Assembly
Data Set Best Practices | Rename
● Rename the sheet that contains the data
“Raw Data.”
Make a copy of that sheet by right
clicking on the sheet’s tab, and
choosing Move or Copy.
● In the window that appears, check off
the box next to “Create a copy.”
● Hit “OK” and rename the copied sheet
“Clean Data.”
9 | © 2023 General Assembly
Computers Out:
Data Set Best Practices
Let’s do this together! Open up the lesson
workbook and...
1. Document ALL of the steps you take in
your analysis.
2. Create a working summary sheet that
includes the following:
a. A directory of other sheets.
b. An explanation of analysis.
c. A short summary of your results.
Be sure to update this sheet regularly!
10 | © 2023 General Assembly
10
Data Cleaning and Formulas
Strategies for Cleaning &
Preparing Your Data
WELCOME TO GA
GENERAL ASSEMBLY
Data Cleaning
Data cleaning is the process of assembling data into a
usable format for analysis.
Common data cleaning actions include:
● Reformatting dates so that Excel recognizes them as dates.
● Extracting day/hour/month/year from a date to aggregate
by those categories.
● Removing duplicate values or rows.
● Combining data sources into one table.
● Concatenating or separating data.
12 | © 2023 General Assembly
NULLs
A NULL value is any missing value in your data.
One common way of conceptualizing a NULL
value is thinking of it as “empty” — not 0, not the
word “NULL,” just empty!
13 | © 2023 General Assembly
Four Primary Strategies for Handling NULLs
1. Find missing values (using reference
resources).
2. Ignore them (some may have meaning).
3. Impute values (e.g., median or zeros).
4. Delete them (only with caution).
14 | © 2023 General Assembly
Discussion:
2 minutes
What to Do With Blank Cells
Take a look at your profit value for
Row 2.
Should this be 0?
Share your answer and reasoning
with the class.
15 | © 2023 General Assembly
Finding and Replacing Blanks
● In the “Home” menu, choose the “Find & Select”
button.
● Click “Go to Special…”
● Select the “Blanks” radio button and hit OK.
● Don’t click anything! Just type a “-”and then hold
down the control key (same for Mac users).
● Tap “enter,” and all of the blank cells should now
be filled in with dashes.
16 | © 2023 General Assembly
Text to Columns
What if we hypothesized that there might be a difference in sales or profit
between states?
Right now, we can’t complete that analysis because city and
state are lumped together. We can fix this, however, using
Excel’s “Text to Columns” feature!
17 | © 2023 General Assembly
Text to Columns | Step by Step
Step 1: Right click on the column to the right of the “city_state” column (it should be
“sub_region”) and choose “Insert” to insert a new blank column to the right of
“city_state.”
Step 2: Click on the “city_state” header to select the entire “city_state” column.
Step 3: Select the “Text to Columns” button in the “Data” menu on the ribbon.
18 | © 2023 General Assembly
Text to Columns | Step by Step
Step 4: Choose delimited. Then, click “Next”
and check off “Comma.” Click “Finish.”
Step 5: If it gives you an error saying it will
replace data, hit “OK.”
Step 6: Rename the “city_state” column to
just “city,” and the second column to “state.”
19 | © 2023 General Assembly
Text to Columns | Trimming the Spaces
Oh no! The space transferred over with the state name.
Let’s clean this up:
Step 1: Insert another column to the right of the state column;
name this new column “state_trimmed.”
Step 2: Use the TRIM function to take out the extra space in front.
=TRIM(text)
20 | © 2020 General Assembly
Discussion:
2 minutes
Checking for Duplicates
Finally, let’s check for duplicates!
What would be an indicator of a duplicate in our data set?
21 | © 2023 General Assembly
Checking for Duplicates | Step by Step
Step 1: Click on “Remove Duplicates”
from the “Data” menu in the ribbon.
Step 2: Uncheck “Select All.” Then,
check off ONLY “order_info_id”
(Column A), “order_id_number”
(Column B) and “product_id” (Column
F) before clicking “OK.”
22 | © 2023 General Assembly
Data Cleaning and Formulas
Asking the Right Questions
(of Your Data)
WELCOME TO GA
GENERAL ASSEMBLY
Asking the Right Questions
What insights about returns
can be gained from the
Superstore data set?
Hmm, this question is
really broad. Let me
explore the data set
first.
24 | © 2023 General Assembly
Exploratory Analysis | Definition
In a nutshell, exploratory analysis means “getting
to know” a data set, which can include:
● Reviewing columns’ names.
● Obtaining aggregate metrics for number
columns (average, sum, min, max, etc.).
● Creating PivotTables to view the unique
values that can appear in a given text column.
● Crafting preliminary visualization.
25 | © 2023 General Assembly
Exploratory Analysis | Best Practices
As part of an exploratory analysis, you should
ALWAYS determine:
● The number of rows in the data set.
○ What each row represents in the data set —
a unique what.
● The number of columns in the data set.
○ What each column represents and how that
data was collected. Try getting a data
dictionary!
26 | © 2023 General Assembly
Computers Out:
5 minutes
Getting to Know the Superstore Data Set
Take five minutes to explore the columns in the
Superstore data set and consider the following:
● How was the data for each column collected?
● What are the units of each column?
27 | © 2023 General Assembly
27
From Questions to Hypotheses
Start by asking yourself…
● What fields can I COMBINE to find interesting
insights?
● What ACTIONS can someone take as a result
of my charts and analyses?
28 | © 2023 General Assembly
From Questions to Hypotheses | Examples
Good example: If we look at profit and ship mode together, we might
discover that certain ship modes are consistently associated with lower
profits. Result/action: We might recommend that Superstore stop offering
those ship modes to customers in order to boost profits.
Bad example: Sales and order_id. We can get the average dollar amount
per item in an order_id; for example, the average cost of a product in order
123 was $15. But that doesn’t really lead to many useful insights for the
store. An aggregate of the average order amount across all orders or
particular categories might be more useful.
29 | © 2023 General Assembly
Discussion:
5 minutes
Formulating Superstore Hypotheses
Let’s brainstorm questions we can ask about the Superstore data set together.
What might be some interesting variables to combine to gain meaningful
insights?
Formulate them into a hypothesis and share it with your class.
30 | © 2023 General Assembly
Partner Exercise:
10 minutes
Formulating Superstore Hypotheses
Let’s revisit the business problem from earlier: We are seeing a high volume
of returns. Now that you’ve identified the data points you need, open the
lesson worksheet and work with your partner to:
1. Identify the questions you can ask to help gain interesting insights from the data.
2. Then, formulate your questions into a hypothesis. Here’s an example:
“If we compare the shipping cost and the order priority, we might find that high
shipping costs for low-priority orders frequently lead to returns.”
3. List it out in your worksheet.
4. Be prepared to share your work with your class.
31 | © 2023 General Assembly
Data Cleaning and Formulas
Introduction to Excel Functions
WELCOME TO GA
GENERAL ASSEMBLY
Data Referencing
Referencing, in its basic form,
means pulling the value of one
cell into another cell.
A2 references A1
33 | © 2023 General Assembly
Cell Referencing
An absolute reference is a fixed (locked) location in a worksheet.
Relative Mixed Mixed Absolute
Column Only Row Only
34 | © 2023 General Assembly
What Is a Formula in Excel?
A formula is an expression which
calculates the value of a cell.
Functions are predefined formulas
that are already available in Excel.
35 | © 2023 General Assembly
Navigating Formulas and Functions in Excel
Current Menu
Formula Bar
Ribbon
Cell Name
36 | © 2023 General Assembly
The Anatomy of an Excel Function
All functions start
with the equals
(=) sign. =LEFT(A2,4)
The arguments (inside the
The name of the function. parentheses) that the function
requires. Arguments are
separated by commas.
37 | © 2023 General Assembly
Finding the Right Function
The typical workflow used by data analysts is:
Step 1: Google the task you are trying to accomplish.
Step 2: Find the name of the function (or functions!)
you need in the search results.
Step 3: Go to the Microsoft Excel documentation to
learn how to implement the function and see
examples.
38 | © 2023 General Assembly
Finding the Right Function | Google It
If you didn’t already know the
function for extracting months from
dates in Excel, here is an example of
how you’d phrase your Google
search:
“How to extract month from date
in Excel.”
39 | © 2023 General Assembly
Finding the Right Function | MS Documentation
Type TEXT function into
the search box. One of the
first results should be the
page for the TEXT
function.
40 | © 2023 General Assembly
Discussion:
1 minute
Finding the Right Function | Arguments
How many arguments does the TEXT
function require?
41 | © 2023 General Assembly
Finding the Right Function | Arguments
Great! So we know that our function takes this form:
=TEXT(argument1, argument2)
Now, let’s figure out what argument1 and argument2 are.
42 | © 2023 General Assembly
TEXT Function | First Argument
=TEXT(argument1, argument2)
We want to format each date in the
The MS documentation tells us that the first
order_date column! To do so, we
argument is “Value you want to format.”
need to start with the first order date.
So, what is it that we want to format? Then, we can drag the formula down
to calculate the rest. Thus, the first
argument of our function will be C2.
43 | © 2023 General Assembly
TEXT Function | Second Argument
=TEXT(argument1, argument2)
According to the MS documentation, the second argument is
“Format code you want to apply.” We need to figure out what
these format codes are.
Scroll down on the page. Do you see a section that might
give us more details? Call out when you find it!
44 | © 2023 General Assembly
Getting the Info We Need
Select “Dates” from the drop-down menu.
To get the full name of the month,
we need to use “mmmm.”
45 | © 2023 General Assembly
Computers Out:
Our First Cleaning Function
Are you ready to clean some data? Let’s get to it!
1. Open up your orders worksheet and add an "order_month" column to the
right of "order_date."
2. Apply one of the functions to the Superstore data set:
● =TEXT(C2, “mmmm”)
● =TEXT([@[order_date]],"mmmm")
Best practice reminder: Put all formulas to the right side of your data set; don’t mix
them in with the raw data.
46 | © 2023 General Assembly
46
Guided Walk-Through:
Data Cleaning With COUNTIF
COUNTIF is another useful function for data cleaning. It can be used to:
● Count the number of cells in a range that contain specific data.
● Tell us whether or not a single cell contains data based on a condition.
When there is a single cell in the COUNTIF range, the maximum that can be
returned is 1 and the minimum that can be returned is 0.
Syntax:
COUNTIF(range cell, condition)
47 | © 2023 General Assembly
47
Guided Walk-Through:
Data Cleaning with COUNTIF | Let’s Try It!
Let’s use COUNTIF to return a 1 or 0 to help us figure out whether or not a
discount is more than our imposed limit of 30%.
1. Open up the “Orders” sheet.
2. Insert a column to the right of the “Discount” column called
“discount_over_30.”
3. Enter =COUNTIF(Q2, “>=.3”).
We can now SUM this column to find out the number of orders that were
discounted more than 30%.
48 | © 2023 General Assembly
48
Guided Walk-Through:
So, What’s Really Going on With Returns? Part 1
To dive deeper into why Superstore is seeing a high volume of returns, we need
to take a closer look at orders, profit, and sales as well as individual customers.
It’s a lot to look at! But don’t worry, we’ll do this together, step by step. First, let’s
find out if some days of the week see higher volumes in sales and returns.
1. To extract the day of the week from the order_date, write out:
=TEXT(C2, “dddd”) OR =TEXT([@[order_date]],"dddd")
49 | © 2023 General Assembly
49
Guided Walk-Through:
So, What’s Really Going on With Returns? Part 2
2. Looking at profit, does profit margin impact whether or not something gets
returned? To find out, recalculate the profit margin (profit divided by sales) per
row. Insert a new column next to profit in Column N.
=N2/M2 or =[@profit]/[@sales]
Next, we will use IFERROR to wrap the formula. We do this to help us deal with
NULLs in the data set.
=IFERROR(formula,"")
50 | © 2023 General Assembly
50
Guided Walk-Through:
So, What’s Really Going on With Returns? Part 3
3. Now, let’s look at individual customers to see if some customers return more
than others. You need to concatenate the order_info_id and the
order_id_number with a dash in between them to create just a order_id column.
Write out:
=[@order_info_id]&”-”&[@order_id_number] OR =A2&“-”&B2
51 | © 2023 General Assembly
51
Guided Walk-Through:
So, What’s Really Going on With Returns? Part 4
4. Finally, let’s decipher sales volume! To help us categorize our sales without
relying on the exact dollar amount, we’ll categorize sale amounts above $500 as
“High” and below $500 as “Low.”
=IF([@sales] > 500, “High”, “Low”) OR
=IF(M2>500, “High”, “Low”)
Now that you have sales categorized, does it make a difference to returns?
52 | © 2023 General Assembly
52
Data Cleaning and Formulas
Wrapping Up
WELCOME TO GA
GENERAL ASSEMBLY
Recap Looking Ahead
Today in class, we...
Up next: Referencing and Lookups
● Applied data cleaning
best practices, including
working with NULLs.
● Conducted exploratory
analyses.
● Experimented with
common Excel formulas.
54 | © 2023 General Assembly
Additional Resources
● Excel File Setup for Analysis
55 | © 2023 General Assembly