0% found this document useful (0 votes)

21 views30 pages

Lec 07-I-DSFa23

The document discusses JSON file formats and provides examples of reading JSON data into Python. It also discusses addressing missing values when reading in a text file of monthly CO2 concentration data from Mauna Loa Observatory.

Uploaded by

labnexaplan9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views30 pages

Lec 07-I-DSFa23

Uploaded by

labnexaplan9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Structure

• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
• Demo: Mauna Loa CO2

From Lec 06:

Lecture 07

1
JSON: JavaScript Object Notation
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file

To reads JSON file:

pd.read_json() function, which
Example works for most simple JSON files.
You will dive deeper into exactly how
a JSON can structured in today’s
notebook.
2
JSON: JavaScript Object Notation
Berkeley covid cases by day
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file
Issues
● Not rectangular
● Each record can have different fields
Example ● Nesting means records can contain tables –
complicated

Reading a JSON into pandas often

requires some EDA.
3
JSON File

1. JSON (JavaScript Object Notation) is a lightweight data-interchange format that

machines can parse and generate easily.

2. Use: for data storage and exchange between a server and a web application, as well
as for configuration files and data serialization.

3. Syntax: JSON data is represented as key-value pairs.

Keys are strings enclosed in double quotes ("), and values can be strings, numbers,
objects, arrays, Boolean values (true or false), null, or nested JSON objects.

4. file extension: .json

4
JSON File: Example

• Most programming languages have libraries or built-

{
in support for parsing and generating JSON data.
"name": "John", • Compact and Efficient: JSON is relatively compact and
"age": 30, efficient for data transmission and storage, making it
"isStudent": false, suitable for various use cases, including mobile
"courses": ["Math", "Science"], applications.
"address": {
• Common Use Cases: JSON is used in a wide range of
applications, including web development (for AJAX
"street": "123 Main St",
requests and data storage), configuration files (e.g.,
"city": "Cityville" package.json in Node.js projects), and as an
} interchange format in APIs.
} • Support for Nested Data: JSON allows for nested data
structures, which can represent complex relationships
and hierarchies.

5
JSON File: Reading

import json
with open(covid_file, "rb") as f:
covid_json = json.load(f)

1. type(covid_json)
2. For Associated Keys in dictionary:
covid_json.keys()
Output: dict_keys(['meta', 'data’])
3. covid_json['meta'].keys()
Output:dict_keys(['view'])

6
JSON File: Reading

covid_json['meta']['view'].keys()

output
dict_keys(['id', 'name', 'assetType', 'attribution',
'averageRating', 'category', 'createdAt', 'description',
'displayType', 'downloadCount', 'hideFromCatalog',
'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid',
'provenance', 'publicationAppendEnabled', 'publicationDate',
'publicationGroup', 'publicationStage', 'rowsUpdatedAt',
'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount',
'viewLastModified', 'viewType', 'approvals', 'clientContext',
'columns', 'grants', 'metadata', 'owner', 'query', 'rights',
'tableAuthor', 'tags', 'flags'])

7
JSON File: Reading [1/2]

covid_json['meta']['view']['columns’]

{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data',

'fieldName': ':sid', 'position': 0, 'renderTypeName':
'meta_data', 'format': {}, 'flags': ['hidden']}

8
JSON File: Reading [2/2]

covid_json['meta']['view']['columns’]

{'id': 542388893, 'name': 'New Cases', 'dataTypeName': 'number',

'description': 'Total number of new cases reported by date created in
CalREDIE. ', 'fieldName': 'bklhj_newcases', 'position': 2,
'renderTypeName': 'number', 'tableColumnId': 98765830, 'cachedContents':
{'non_null': '1387', 'largest': '326', 'null': '0', 'top': [{'item':
'0', 'count': '144'}, {'item': '1', 'count': '99'}, {'item': '2',
'count': '88'}, {'item': '4', 'count': '87'}, {'item': '3', 'count':
'86'}, {'item': '5', 'count': '65'}, {'item': '6', 'count': '62'},
{'item': '7', 'count': '54'}, {'item': '8', 'count': '45'}, {'item':
'11', 'count': '40'}, {'item': '9', 'count': '40'}, {'item': '12',
'count': '36'}, {'item': '13', 'count': '34'}, {'item': '10', 'count':
'34'}, {'item': '16', 'count': '24'}, {'item': '17', 'count': '23'},
{'item': '14', 'count': '23'}, {'item': '19', 'count': '22'}, {'item':
'18', 'count': '21'}, {'item': '15', 'count': '21'}], 'smallest': '0',
'count': '1387', 'cardinality': '114'}, 'format': {}}
9
Example: Calls data

● Looks like there are three columns with dates/times: EVENTDT, EVENTTM,
and InDbDate.
● Most likely, EVENTDT stands for the date when the event took place
● EVENTTM stands for the time of day the event took place (in 24-hr format)
● InDbDate is the date this call is recorded on the database.

calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
calls["EVENTDT"].dt.month
calls["EVENTDT"].dt.dayofweek

10
Structure
• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
Demo: Mauna Loa • Example: Mauna Loa CO2

CO2
Lecture 07

11
Aside: An update to the Mauna Loa Dataset
https://gml.noaa.gov/ccgg/trends/data.html

NPS 12
What Are Our Variable Feature Types?

EDA step:
Understand what each record, each feature
represents

First, read file description:

● All measurement variables (average,
interpolated, trend) are monthly mean CO2
monthly mean mole fraction
○ i.e. monthly average CO2 ppm (parts per
million)
○ Computed from daily means
● #days: Number of daily means in a month
Example (i.e., # days equipment worked)
What variables define the first three columns?
● Year, month, and date in decimal
13
The Search for the Missing Values

EDA step:
Hypothesize why these values were missing,
then use that knowledge to decide whether to
drop or impute missing values

From file description:

● -99.99: missing monthly average Avg
● -1: missing value for # days that the
equipment was in operation that month.

Which approach?
• Drop missing values
Example • Keep missing values as NaN
• Impute

14
How should we address the
missing Avg data?
Summary: Dealing with Missing Values
Mauna Loa Observatory CO2 levels (NOAA)
-99.99: missing monthly average Avg

Option A: Drop records

Option B: NaN missing values
Option C: Impute using interpolated column Int

All 3 are probably fine since few missing values, but

we chose Option 3 based on our EDA.

With numeric data, you generally wrangle as

you do EDA.

With text data, wrangling is upfront and

requires new tools: Python string manipulation
and regular expressions.
16
Txt File

● Note Mauna Loa CO2 data is a .txt file

● Use same pd.read_csv to read file
● Use skiprows parameter to skip rows
● Use sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex)
In this given example
● You need to visualize the monthly average CO2 concentration using sns.lineplot to
check the missing values.
● Verify that all records are listed correctly using .shape
● Check the distribution for days using sns.displot
● Check the connection between missingness and the year of the recording using
sns.scatterplot

17
Start Work on Notebook

18
LECTURE 7

Text Wrangling and Regex

Using string methods and regular expressions (regex) to work with textual data

Data Science@ Knowledge Stream

Sana Jabbar

19
This Week

? Question & Problem

Formulation
Data
Acquisition

Prediction and Exploratory

Inference Data Analysis

Reports, Decisions,
and Solutions
(Last weeks) (Today) (Next)

Data Wrangling Working with Text Data Visualization

Intro to EDA Regular Expressions Code for plotting data

20
Deal with a major challenge of EDA: cleaning
text
• Operate on text data using str methods
• Apply regex to identify patterns in

Goals for this strings

Lecture
Lecture 07

21
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Agenda
Lecture 07

22
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

Why Work With • Regex functions

Text?
Lecture 07

23
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has
more than one possible presentation into
a standard form.

Ex Join tables with mismatched labels

24
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has 2. Extract information into a
more than one possible presentation into new feature.
a standard form.

Ex Join tables with mismatched labels Ex Extract dates and times from log
files
169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800] "GET
/stat141/Winter04/ HTTP/1.1" 200 2585
"http://anson.ucdavis.edu/courses/"

day, month, year = "26", "Jan", "2014"

hour, minute, seconds = "10", "47", "58"

25
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

pandas str • Regex functions

Methods
Lecture 07

26
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:

s.lower() replacement/ s.replace(…)

transformation s.upper() deletion

s.split(…) s[1:4]
split substring

'ab' in s length len(s)

membership

Problem: Python assumes we are working with one string at a time

Need to loop over each entry – slow in large datasets!

27
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str operator

Series.str.string_operation()

Apply the function string_operation to every string contained in the Series

populations[“County”].str.lower() populations[“County”].str.replace('&', 'and')

28
.str Methods
Most base Python string operations have a pandas str equivalent

Operation Python (single string) pandas (Series of strings)

s.lower() ser.str.lower()
transformation s.upper() ser.str.upper()

replacement/ s.replace(…) ser.str.replace(…)

deletion

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

'ab' in s ser.str.contains(…)
membership

length len(s) ser.str.len()

29
Demo 1: Canonicalization

def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
.str.replace(' ', '') # remove space
.str.replace('&', 'and') # replace &
Example .str.replace('.', '') # remove dot
.str.replace('county', '')
.str.replace('parish', ’’)

Data Science for UC Berkeley Students
No ratings yet
Data Science for UC Berkeley Students
52 pages
DS Final
No ratings yet
DS Final
46 pages
Lec 06-DSFa23
No ratings yet
Lec 06-DSFa23
45 pages
Unit 3
No ratings yet
Unit 3
110 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Pandas Data Structures Guide
No ratings yet
Pandas Data Structures Guide
72 pages
UNIT II Material
No ratings yet
UNIT II Material
34 pages
Data Analysis Using Python Day - 1 To Day - 4
No ratings yet
Data Analysis Using Python Day - 1 To Day - 4
30 pages
Data Visualization for Researchers
No ratings yet
Data Visualization for Researchers
64 pages
Python Programming For Data Analysis 1st Edition José Unpingco Full Chapters Included
No ratings yet
Python Programming For Data Analysis 1st Edition José Unpingco Full Chapters Included
84 pages
Python Data Science Course Guide
100% (1)
Python Data Science Course Guide
5 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
CS3352 FDS QP Solved (Anna University)
No ratings yet
CS3352 FDS QP Solved (Anna University)
98 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
01 Pandas Basics
No ratings yet
01 Pandas Basics
5 pages
Glossary Working With Data in Python
No ratings yet
Glossary Working With Data in Python
2 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Pandas
No ratings yet
Pandas
41 pages
Data Science Fundamentals With R Python and Open Data 1st Edition Marco Cremonini Download
100% (1)
Data Science Fundamentals With R Python and Open Data 1st Edition Marco Cremonini Download
52 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Python For Data Science
No ratings yet
Python For Data Science
12 pages
Glosario m4
No ratings yet
Glosario m4
2 pages
PDSC Few Questions Answers 2020
No ratings yet
PDSC Few Questions Answers 2020
36 pages
02 Python Basics
No ratings yet
02 Python Basics
52 pages
UU Python Training Session 2 2022 02 15 v02
No ratings yet
UU Python Training Session 2 2022 02 15 v02
22 pages
Minimalist Datawrangling Withpython Marek Gagolewski PDF Download
100% (1)
Minimalist Datawrangling Withpython Marek Gagolewski PDF Download
79 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
NumPy and Pandas: Essential Python Libraries
No ratings yet
NumPy and Pandas: Essential Python Libraries
72 pages
DWV Unit1
No ratings yet
DWV Unit1
102 pages
Week 2 Laboratory Activity
No ratings yet
Week 2 Laboratory Activity
7 pages
Advanced Python Day 3
No ratings yet
Advanced Python Day 3
21 pages
CS3352 Foundations of Data Science Nov Dec 2022
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2022
36 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Sumita Arora Class 12
100% (2)
Sumita Arora Class 12
262 pages
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
No ratings yet
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
16 pages
CE880 Lecture3 Slides
No ratings yet
CE880 Lecture3 Slides
44 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
Report File
No ratings yet
Report File
40 pages
Data Analyst Compressed
No ratings yet
Data Analyst Compressed
51 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
GVPCOEW-Pandas and Numpy For Data Analysis - DONE
No ratings yet
GVPCOEW-Pandas and Numpy For Data Analysis - DONE
110 pages
Eda Lab 1 Asn 202301126
No ratings yet
Eda Lab 1 Asn 202301126
65 pages
Data Cleaning Course Notes
No ratings yet
Data Cleaning Course Notes
27 pages
4 Data Visualization
No ratings yet
4 Data Visualization
76 pages
Data Analyst Masters with PowerBI
No ratings yet
Data Analyst Masters with PowerBI
27 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Python Cheat Sheet For Beginners
No ratings yet
Python Cheat Sheet For Beginners
1 page
Data Extraction: Parse A 3-Nested JSON Object and Convert It To A Pandas Dataframe
No ratings yet
Data Extraction: Parse A 3-Nested JSON Object and Convert It To A Pandas Dataframe
1 page
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Lec 05-DSFa23
No ratings yet
Lec 05-DSFa23
65 pages
Tahir CV
No ratings yet
Tahir CV
3 pages
Fa22 RCS 008
No ratings yet
Fa22 RCS 008
14 pages
1brf An Introduction To Rubrik For Mongodb Data Protection Tech Brief
No ratings yet
1brf An Introduction To Rubrik For Mongodb Data Protection Tech Brief
5 pages
S1 Ict End of Year
No ratings yet
S1 Ict End of Year
3 pages
Black Box Testing
No ratings yet
Black Box Testing
3 pages
PG-I (I-Sem Syllabus)
No ratings yet
PG-I (I-Sem Syllabus)
6 pages
18-Dial's Algorithm-06-02-2025
No ratings yet
18-Dial's Algorithm-06-02-2025
24 pages
Ficha Técnica ENG - Tableros de Control de Iluminación PDF
No ratings yet
Ficha Técnica ENG - Tableros de Control de Iluminación PDF
12 pages
Desktop Publishing
No ratings yet
Desktop Publishing
10 pages
AMINA Group Case Study V1
No ratings yet
AMINA Group Case Study V1
2 pages
Hach Sc100 Controller User Manual
No ratings yet
Hach Sc100 Controller User Manual
64 pages
Acoem Serinus 30H Spec Sheet 22021207
No ratings yet
Acoem Serinus 30H Spec Sheet 22021207
4 pages
Session 1 and 2 Course Overview and Intro To R
No ratings yet
Session 1 and 2 Course Overview and Intro To R
147 pages
Algebraic vs. Transcendental Functions
No ratings yet
Algebraic vs. Transcendental Functions
21 pages
Windows Hardware Design
No ratings yet
Windows Hardware Design
1,324 pages
The Processor
No ratings yet
The Processor
19 pages
Digital Forensics Midterm Case Study
No ratings yet
Digital Forensics Midterm Case Study
5 pages
MineSight - Designing Pits For LTP With Pit Expansion Tool
100% (2)
MineSight - Designing Pits For LTP With Pit Expansion Tool
51 pages
ICAML 2021: 3 International Conference On Applications of AI & Machine Learning
No ratings yet
ICAML 2021: 3 International Conference On Applications of AI & Machine Learning
2 pages
Real Log Book
No ratings yet
Real Log Book
24 pages
MCA 102 End Term 2024-2026
No ratings yet
MCA 102 End Term 2024-2026
2 pages
2-3 - The Serial Monitor
No ratings yet
2-3 - The Serial Monitor
10 pages
Crew Acceptance Checklist 2022 Ed Af DL 27052022
No ratings yet
Crew Acceptance Checklist 2022 Ed Af DL 27052022
49 pages
Mini Hi-Fi Component System: MHC-RV6/RV5
No ratings yet
Mini Hi-Fi Component System: MHC-RV6/RV5
44 pages
Sequential Circuits
No ratings yet
Sequential Circuits
19 pages
PDF Version Quick Guide Resources Discussion: Job Search
No ratings yet
PDF Version Quick Guide Resources Discussion: Job Search
81 pages
Sap Bods: - Vijaya Polisetty
No ratings yet
Sap Bods: - Vijaya Polisetty
51 pages
Certificate
No ratings yet
Certificate
2 pages
MRL3702 Examination On 2023 EF On
No ratings yet
MRL3702 Examination On 2023 EF On
10 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
1 - Chapter 3 Product Assurance
No ratings yet
1 - Chapter 3 Product Assurance
82 pages
Op Jeeva1
No ratings yet
Op Jeeva1
36 pages

Lec 07-I-DSFa23

Uploaded by

Lec 07-I-DSFa23

Uploaded by

Structure

From Lec 06:

To reads JSON file:

Reading a JSON into pandas often

1. JSON (JavaScript Object Notation) is a lightweight data-interchange format that

3. Syntax: JSON data is represented as key-value pairs.

4. file extension: .json

• Most programming languages have libraries or built-

{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data',

{'id': 542388893, 'name': 'New Cases', 'dataTypeName': 'number',

First, read file description:

From file description:

Option A: Drop records

All 3 are probably fine since few missing values, but

With numeric data, you generally wrangle as

With text data, wrangling is upfront and

● Note Mauna Loa CO2 data is a .txt file

Text Wrangling and Regex

Data Science@ Knowledge Stream

? Question & Problem

Prediction and Exploratory

Data Wrangling Working with Text Data Visualization

Goals for this strings

Why Work With • Regex functions

Ex Join tables with mismatched labels

day, month, year = "26", "Jan", "2014"

pandas str • Regex functions

s.lower() replacement/ s.replace(…)

'ab' in s length len(s)

Problem: Python assumes we are working with one string at a time

Apply the function string_operation to every string contained in the Series

populations[“County”].str.lower() populations[“County”].str.replace('&', 'and')

Operation Python (single string) pandas (Series of strings)

replacement/ s.replace(…) ser.str.replace(…)

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

length len(s) ser.str.len()

You might also like