Structure
• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
• Demo: Mauna Loa CO2
From Lec 06:
Lecture 07
1
JSON: JavaScript Object Notation
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file
To reads JSON file:
pd.read_json() function, which
Example works for most simple JSON files.
You will dive deeper into exactly how
a JSON can structured in today’s
notebook.
2
JSON: JavaScript Object Notation
Berkeley covid cases by day
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file
Issues
● Not rectangular
● Each record can have different fields
Example ● Nesting means records can contain tables –
complicated
Reading a JSON into pandas often
requires some EDA.
3
JSON File
1. JSON (JavaScript Object Notation) is a lightweight data-interchange format that
machines can parse and generate easily.
2. Use: for data storage and exchange between a server and a web application, as well
as for configuration files and data serialization.
3. Syntax: JSON data is represented as key-value pairs.
Keys are strings enclosed in double quotes ("), and values can be strings, numbers,
objects, arrays, Boolean values (true or false), null, or nested JSON objects.
4. file extension: .json
4
JSON File: Example
• Most programming languages have libraries or built-
{
in support for parsing and generating JSON data.
"name": "John", • Compact and Efficient: JSON is relatively compact and
"age": 30, efficient for data transmission and storage, making it
"isStudent": false, suitable for various use cases, including mobile
"courses": ["Math", "Science"], applications.
"address": {
• Common Use Cases: JSON is used in a wide range of
applications, including web development (for AJAX
"street": "123 Main St",
requests and data storage), configuration files (e.g.,
"city": "Cityville" package.json in Node.js projects), and as an
} interchange format in APIs.
} • Support for Nested Data: JSON allows for nested data
structures, which can represent complex relationships
and hierarchies.
5
JSON File: Reading
import json
with open(covid_file, "rb") as f:
covid_json = json.load(f)
1. type(covid_json)
2. For Associated Keys in dictionary:
covid_json.keys()
Output: dict_keys(['meta', 'data’])
3. covid_json['meta'].keys()
Output:dict_keys(['view'])
6
JSON File: Reading
covid_json['meta']['view'].keys()
output
dict_keys(['id', 'name', 'assetType', 'attribution',
'averageRating', 'category', 'createdAt', 'description',
'displayType', 'downloadCount', 'hideFromCatalog',
'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid',
'provenance', 'publicationAppendEnabled', 'publicationDate',
'publicationGroup', 'publicationStage', 'rowsUpdatedAt',
'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount',
'viewLastModified', 'viewType', 'approvals', 'clientContext',
'columns', 'grants', 'metadata', 'owner', 'query', 'rights',
'tableAuthor', 'tags', 'flags'])
7
JSON File: Reading [1/2]
covid_json['meta']['view']['columns’]
{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data',
'fieldName': ':sid', 'position': 0, 'renderTypeName':
'meta_data', 'format': {}, 'flags': ['hidden']}
8
JSON File: Reading [2/2]
covid_json['meta']['view']['columns’]
{'id': 542388893, 'name': 'New Cases', 'dataTypeName': 'number',
'description': 'Total number of new cases reported by date created in
CalREDIE. ', 'fieldName': 'bklhj_newcases', 'position': 2,
'renderTypeName': 'number', 'tableColumnId': 98765830, 'cachedContents':
{'non_null': '1387', 'largest': '326', 'null': '0', 'top': [{'item':
'0', 'count': '144'}, {'item': '1', 'count': '99'}, {'item': '2',
'count': '88'}, {'item': '4', 'count': '87'}, {'item': '3', 'count':
'86'}, {'item': '5', 'count': '65'}, {'item': '6', 'count': '62'},
{'item': '7', 'count': '54'}, {'item': '8', 'count': '45'}, {'item':
'11', 'count': '40'}, {'item': '9', 'count': '40'}, {'item': '12',
'count': '36'}, {'item': '13', 'count': '34'}, {'item': '10', 'count':
'34'}, {'item': '16', 'count': '24'}, {'item': '17', 'count': '23'},
{'item': '14', 'count': '23'}, {'item': '19', 'count': '22'}, {'item':
'18', 'count': '21'}, {'item': '15', 'count': '21'}], 'smallest': '0',
'count': '1387', 'cardinality': '114'}, 'format': {}}
9
Example: Calls data
● Looks like there are three columns with dates/times: EVENTDT, EVENTTM,
and InDbDate.
● Most likely, EVENTDT stands for the date when the event took place
● EVENTTM stands for the time of day the event took place (in 24-hr format)
● InDbDate is the date this call is recorded on the database.
calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
calls["EVENTDT"].dt.month
calls["EVENTDT"].dt.dayofweek
10
Structure
• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
Demo: Mauna Loa • Example: Mauna Loa CO2
CO2
Lecture 07
11
Aside: An update to the Mauna Loa Dataset
https://gml.noaa.gov/ccgg/trends/data.html
NPS 12
What Are Our Variable Feature Types?
EDA step:
Understand what each record, each feature
represents
First, read file description:
● All measurement variables (average,
interpolated, trend) are monthly mean CO2
monthly mean mole fraction
○ i.e. monthly average CO2 ppm (parts per
million)
○ Computed from daily means
● #days: Number of daily means in a month
Example (i.e., # days equipment worked)
What variables define the first three columns?
● Year, month, and date in decimal
13
The Search for the Missing Values
EDA step:
Hypothesize why these values were missing,
then use that knowledge to decide whether to
drop or impute missing values
From file description:
● -99.99: missing monthly average Avg
● -1: missing value for # days that the
equipment was in operation that month.
Which approach?
• Drop missing values
Example • Keep missing values as NaN
• Impute
14
How should we address the
missing Avg data?
Summary: Dealing with Missing Values
Mauna Loa Observatory CO2 levels (NOAA)
-99.99: missing monthly average Avg
Option A: Drop records
Option B: NaN missing values
Option C: Impute using interpolated column Int
All 3 are probably fine since few missing values, but
we chose Option 3 based on our EDA.
With numeric data, you generally wrangle as
you do EDA.
With text data, wrangling is upfront and
requires new tools: Python string manipulation
and regular expressions.
16
Txt File
● Note Mauna Loa CO2 data is a .txt file
● Use same pd.read_csv to read file
● Use skiprows parameter to skip rows
● Use sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex)
In this given example
● You need to visualize the monthly average CO2 concentration using sns.lineplot to
check the missing values.
● Verify that all records are listed correctly using .shape
● Check the distribution for days using sns.displot
● Check the connection between missingness and the year of the recording using
sns.scatterplot
17
Start Work on Notebook
18
LECTURE 7
Text Wrangling and Regex
Using string methods and regular expressions (regex) to work with textual data
Data Science@ Knowledge Stream
Sana Jabbar
19
This Week
? Question & Problem
Formulation
Data
Acquisition
Prediction and Exploratory
Inference Data Analysis
Reports, Decisions,
and Solutions
(Last weeks) (Today) (Next)
Data Wrangling Working with Text Data Visualization
Intro to EDA Regular Expressions Code for plotting data
20
Deal with a major challenge of EDA: cleaning
text
• Operate on text data using str methods
• Apply regex to identify patterns in
Goals for this strings
Lecture
Lecture 07
21
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions
Agenda
Lecture 07
22
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
Why Work With • Regex functions
Text?
Lecture 07
23
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has
more than one possible presentation into
a standard form.
Ex Join tables with mismatched labels
24
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has 2. Extract information into a
more than one possible presentation into new feature.
a standard form.
Ex Join tables with mismatched labels Ex Extract dates and times from log
files
169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800] "GET
/stat141/Winter04/ HTTP/1.1" 200 2585
"http://anson.ucdavis.edu/courses/"
day, month, year = "26", "Jan", "2014"
hour, minute, seconds = "10", "47", "58"
25
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
pandas str • Regex functions
Methods
Lecture 07
26
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:
s.lower() replacement/ s.replace(…)
transformation s.upper() deletion
s.split(…) s[1:4]
split substring
'ab' in s length len(s)
membership
Problem: Python assumes we are working with one string at a time
Need to loop over each entry – slow in large datasets!
27
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str operator
Series.str.string_operation()
Apply the function string_operation to every string contained in the Series
populations[“County”].str.lower() populations[“County”].str.replace('&', 'and')
28
.str Methods
Most base Python string operations have a pandas str equivalent
Operation Python (single string) pandas (Series of strings)
s.lower() ser.str.lower()
transformation s.upper() ser.str.upper()
replacement/ s.replace(…) ser.str.replace(…)
deletion
split s.split(…) ser.str.split(…)
substring s[1:4] ser.str[1:4]
'ab' in s ser.str.contains(…)
membership
length len(s) ser.str.len()
29
Demo 1: Canonicalization
def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
.str.replace(' ', '') # remove space
.str.replace('&', 'and') # replace &
Example .str.replace('.', '') # remove dot
.str.replace('county', '')
.str.replace('parish', ’’)
30