Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views30 pages

Lec 07-I-DSFa23

The document discusses JSON file formats and provides examples of reading JSON data into Python. It also discusses addressing missing values when reading in a text file of monthly CO2 concentration data from Mauna Loa Observatory.

Uploaded by

labnexaplan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views30 pages

Lec 07-I-DSFa23

The document discusses JSON file formats and provides examples of reading JSON data into Python. It also discusses addressing missing values when reading in a text file of monthly CO2 concentration data from Mauna Loa Observatory.

Uploaded by

labnexaplan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Structure

• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
• Demo: Mauna Loa CO2

From Lec 06:


Lecture 07

1
JSON: JavaScript Object Notation
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file

To reads JSON file:


pd.read_json() function, which
Example works for most simple JSON files.
You will dive deeper into exactly how
a JSON can structured in today’s
notebook.
2
JSON: JavaScript Object Notation
Berkeley covid cases by day
A less common file format.
● Very similar to Python dictionaries
● Strict formatting ”quoting” addresses some
issues in CSV/TSV
● Self-documenting: Can save metadata (data
about the data) along with records in the
same file
Issues
● Not rectangular
● Each record can have different fields
Example ● Nesting means records can contain tables –
complicated

Reading a JSON into pandas often


requires some EDA.
3
JSON File

1. JSON (JavaScript Object Notation) is a lightweight data-interchange format that


machines can parse and generate easily.

2. Use: for data storage and exchange between a server and a web application, as well
as for configuration files and data serialization.

3. Syntax: JSON data is represented as key-value pairs.

Keys are strings enclosed in double quotes ("), and values can be strings, numbers,
objects, arrays, Boolean values (true or false), null, or nested JSON objects.

4. file extension: .json

4
JSON File: Example

• Most programming languages have libraries or built-


{
in support for parsing and generating JSON data.
"name": "John", • Compact and Efficient: JSON is relatively compact and
"age": 30, efficient for data transmission and storage, making it
"isStudent": false, suitable for various use cases, including mobile
"courses": ["Math", "Science"], applications.
"address": {
• Common Use Cases: JSON is used in a wide range of
applications, including web development (for AJAX
"street": "123 Main St",
requests and data storage), configuration files (e.g.,
"city": "Cityville" package.json in Node.js projects), and as an
} interchange format in APIs.
} • Support for Nested Data: JSON allows for nested data
structures, which can represent complex relationships
and hierarchies.

5
JSON File: Reading

import json
with open(covid_file, "rb") as f:
covid_json = json.load(f)

1. type(covid_json)
2. For Associated Keys in dictionary:
covid_json.keys()
Output: dict_keys(['meta', 'data’])
3. covid_json['meta'].keys()
Output:dict_keys(['view'])

6
JSON File: Reading

covid_json['meta']['view'].keys()

output
dict_keys(['id', 'name', 'assetType', 'attribution',
'averageRating', 'category', 'createdAt', 'description',
'displayType', 'downloadCount', 'hideFromCatalog',
'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid',
'provenance', 'publicationAppendEnabled', 'publicationDate',
'publicationGroup', 'publicationStage', 'rowsUpdatedAt',
'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount',
'viewLastModified', 'viewType', 'approvals', 'clientContext',
'columns', 'grants', 'metadata', 'owner', 'query', 'rights',
'tableAuthor', 'tags', 'flags'])

7
JSON File: Reading [1/2]

covid_json['meta']['view']['columns’]

{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data',


'fieldName': ':sid', 'position': 0, 'renderTypeName':
'meta_data', 'format': {}, 'flags': ['hidden']}

8
JSON File: Reading [2/2]

covid_json['meta']['view']['columns’]

{'id': 542388893, 'name': 'New Cases', 'dataTypeName': 'number',


'description': 'Total number of new cases reported by date created in
CalREDIE. ', 'fieldName': 'bklhj_newcases', 'position': 2,
'renderTypeName': 'number', 'tableColumnId': 98765830, 'cachedContents':
{'non_null': '1387', 'largest': '326', 'null': '0', 'top': [{'item':
'0', 'count': '144'}, {'item': '1', 'count': '99'}, {'item': '2',
'count': '88'}, {'item': '4', 'count': '87'}, {'item': '3', 'count':
'86'}, {'item': '5', 'count': '65'}, {'item': '6', 'count': '62'},
{'item': '7', 'count': '54'}, {'item': '8', 'count': '45'}, {'item':
'11', 'count': '40'}, {'item': '9', 'count': '40'}, {'item': '12',
'count': '36'}, {'item': '13', 'count': '34'}, {'item': '10', 'count':
'34'}, {'item': '16', 'count': '24'}, {'item': '17', 'count': '23'},
{'item': '14', 'count': '23'}, {'item': '19', 'count': '22'}, {'item':
'18', 'count': '21'}, {'item': '15', 'count': '21'}], 'smallest': '0',
'count': '1387', 'cardinality': '114'}, 'format': {}}
9
Example: Calls data

● Looks like there are three columns with dates/times: EVENTDT, EVENTTM,
and InDbDate.
● Most likely, EVENTDT stands for the date when the event took place
● EVENTTM stands for the time of day the event took place (in 24-hr format)
● InDbDate is the date this call is recorded on the database.

calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
calls["EVENTDT"].dt.month
calls["EVENTDT"].dt.dayofweek

10
Structure
• Multiple Files
• More File Formats
Scope and Temporality
Faithfulness (and Missing Values)
Demo: Mauna Loa • Example: Mauna Loa CO2

CO2
Lecture 07

11
Aside: An update to the Mauna Loa Dataset
https://gml.noaa.gov/ccgg/trends/data.html

NPS 12
What Are Our Variable Feature Types?

EDA step:
Understand what each record, each feature
represents

First, read file description:


● All measurement variables (average,
interpolated, trend) are monthly mean CO2
monthly mean mole fraction
○ i.e. monthly average CO2 ppm (parts per
million)
○ Computed from daily means
● #days: Number of daily means in a month
Example (i.e., # days equipment worked)
What variables define the first three columns?
● Year, month, and date in decimal
13
The Search for the Missing Values

EDA step:
Hypothesize why these values were missing,
then use that knowledge to decide whether to
drop or impute missing values

From file description:


● -99.99: missing monthly average Avg
● -1: missing value for # days that the
equipment was in operation that month.

Which approach?
• Drop missing values
Example • Keep missing values as NaN
• Impute

14
How should we address the
missing Avg data?
Summary: Dealing with Missing Values
Mauna Loa Observatory CO2 levels (NOAA)
-99.99: missing monthly average Avg

Option A: Drop records


Option B: NaN missing values
Option C: Impute using interpolated column Int

All 3 are probably fine since few missing values, but


we chose Option 3 based on our EDA.

With numeric data, you generally wrangle as


you do EDA.

With text data, wrangling is upfront and


requires new tools: Python string manipulation
and regular expressions.
16
Txt File

● Note Mauna Loa CO2 data is a .txt file


● Use same pd.read_csv to read file
● Use skiprows parameter to skip rows
● Use sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex)
In this given example
● You need to visualize the monthly average CO2 concentration using sns.lineplot to
check the missing values.
● Verify that all records are listed correctly using .shape
● Check the distribution for days using sns.displot
● Check the connection between missingness and the year of the recording using
sns.scatterplot

17
Start Work on Notebook

18
LECTURE 7

Text Wrangling and Regex


Using string methods and regular expressions (regex) to work with textual data

Data Science@ Knowledge Stream


Sana Jabbar

19
This Week

? Question & Problem


Formulation
Data
Acquisition

Prediction and Exploratory


Inference Data Analysis

Reports, Decisions,
and Solutions
(Last weeks) (Today) (Next)

Data Wrangling Working with Text Data Visualization


Intro to EDA Regular Expressions Code for plotting data

20
Deal with a major challenge of EDA: cleaning
text
• Operate on text data using str methods
• Apply regex to identify patterns in

Goals for this strings

Lecture
Lecture 07

21
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Agenda
Lecture 07

22
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

Why Work With • Regex functions

Text?
Lecture 07

23
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has
more than one possible presentation into
a standard form.

Ex Join tables with mismatched labels

24
Why Work With Text? Two Common Goals
1. Canonicalization: Convert data that has 2. Extract information into a
more than one possible presentation into new feature.
a standard form.

Ex Join tables with mismatched labels Ex Extract dates and times from log
files
169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800] "GET
/stat141/Winter04/ HTTP/1.1" 200 2585
"http://anson.ucdavis.edu/courses/"

day, month, year = "26", "Jan", "2014"


hour, minute, seconds = "10", "47", "58"

25
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

pandas str • Regex functions

Methods
Lecture 07

26
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:

s.lower() replacement/ s.replace(…)


transformation s.upper() deletion

s.split(…) s[1:4]
split substring

'ab' in s length len(s)


membership

Problem: Python assumes we are working with one string at a time


Need to loop over each entry – slow in large datasets!

27
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str operator

Series.str.string_operation()

Apply the function string_operation to every string contained in the Series

populations[“County”].str.lower() populations[“County”].str.replace('&', 'and')

28
.str Methods
Most base Python string operations have a pandas str equivalent

Operation Python (single string) pandas (Series of strings)

s.lower() ser.str.lower()
transformation s.upper() ser.str.upper()

replacement/ s.replace(…) ser.str.replace(…)


deletion

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

'ab' in s ser.str.contains(…)
membership

length len(s) ser.str.len()

29
Demo 1: Canonicalization

def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
.str.replace(' ', '') # remove space
.str.replace('&', 'and') # replace &
Example .str.replace('.', '') # remove dot
.str.replace('county', '')
.str.replace('parish', ’’)

30

You might also like