0% found this document useful (0 votes)

21 views26 pages

1 - Introduction ToPySpark

The document provides an introduction to PySpark, a tool for distributed data processing that supports various data formats and integrates SQL for querying. It covers key concepts such as Spark clusters, SparkSessions, DataFrames, and essential functions for data manipulation and analytics. Additionally, it highlights the creation of DataFrames from different data sources and the importance of schema inference and data types in PySpark.

Uploaded by

maengora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views26 pages

1 - Introduction ToPySpark

Uploaded by

maengora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to

PySpark
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Meet your instructor
Almost a Decade of Data Experience with PySpark
Used PySpark for Machine Learning, ETL tasks, and much more more

Enthusiastic teacher of new tools for all!

INTRODUCTION TO PYSPARK
What is PySpark?
Distributed data processing: Designed to handle large datasets across clusters

Supports various data formats including CSV, Parquet, and JSON

SQL integration allows querying of data using both Python and SQL syntax

Optimized for speed at scale

INTRODUCTION TO PYSPARK
When would we use PySpark?
Big data analytics
Distributed data processing

Real-time data streaming

Machine learning on large datasets

ETL and ELT pipelines

Working with diverse data sources:

1. CSV

2. JSON

3. Parquet
4. Many Many More

INTRODUCTION TO PYSPARK
Spark cluster
Master Node Worker Nodes
Manages the cluster, coordinates tasks, Execute the tasks assigned by the master
and schedules jobs
Responsible for executing the actual
computations and storing data in memory
or disk

INTRODUCTION TO PYSPARK
SparkSession
SparkSessions allow you to access your Spark cluster and are critical for using PySpark.

# Import SparkSession
from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

.builder() sets up a session

getOrCreate() creates or retrieves a session

.appName() helps manage multiple sessions

INTRODUCTION TO PYSPARK
PySpark DataFrames
Similar to other DataFrames but
Optimized for PySpark

# Import and initialize a Spark session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

# Create a DataFrame
census_df = spark.read.csv("census.csv",
["gender","age","zipcode","salary_range_usd","marriage_status"])

# Show the DataFrame

census_df.show()

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
Introduction to
PySpark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
About DataFrames
DataFrames: Tabular format (rows/columns)

Supports SQL-like operations

Comparable to a Pandas Dataframe or a SQL TABLE

Structured Data

INTRODUCTION TO PYSPARK
Creating DataFrames from filestores
# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)

INTRODUCTION TO PYSPARK
Printing the DataFrame
# Show the first 5 rows of the DataFrame
census_df.show()

age education.num marital.status occupation income

0 90 9 Widowed ? <=50K
1 82 9 Widowed Exec-managerial <=50K
2 66 10 Widowed ? <=50K
3 54 4 Divorced Machine-op-inspct <=50K
4 41 10 Separated Prof-specialty <=50K

INTRODUCTION TO PYSPARK
Printing DataFrame Schema
# Show the schema
census_df.printSchema()
Output:
root
|-- age: integer (nullable = true)
|-- education.num: integer (nullable = true)
|-- marital.status: string (nullable = true)
|-- occupation: string (nullable = true)
|-- income: string (nullable = true)

INTRODUCTION TO PYSPARK
Basic analytics on PySpark DataFrames
# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')

# groupby() allows the use of sql-like aggregations

census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show()

Other aggregate functions are:

sum()

min()

max()

INTRODUCTION TO PYSPARK
Key functions for PySpark analytics
.select() : Selects specific columns from the DataFrame

.filter() : Filters rows based on specific conditions

.groupBy() : Groups rows based on one or more columns

.agg() : Applies aggregate functions to grouped data

INTRODUCTION TO PYSPARK
Key Functions For Example
# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()
Output
+---+------------------+
|age| occupation |
+---+------------------+
| 90| ?|
| 82| Exec-managerial|
| 66| ?|
| 54| Machine-op-inspct|
+---+------------------+

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
More on Spark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Creating DataFrames from various data sources
CSV Files: Common for structured, Example:
delimited data spark.read.csv("path/to/file.csv")

JSON Files: Semi-structured, hierarchical

Example:
data format
spark.read.json("path/to/file.json")
Parquet Files: Optimized for storage and
querying, often used in data engineering Example:
spark.read.parquet("path/to/file.parquet")

1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv

INTRODUCTION TO PYSPARK
Schema inference and manual schema definition
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures

INTRODUCTION TO PYSPARK
DataTypes in PySpark DataFrames
IntegerType : Whole numbers
E.g., 1 , 3478 , -1890456

LongType: Larger whole numbers

E.g., 8-byte signed numbers, 922334775806

FloatType and DoubleType: Floating-point numbers for decimal values

E.g., 3.14159

StringType: Used for text or string data

E.g., "This is an example of a string."

...

INTRODUCTION TO PYSPARK
DataTypes Syntax for PySpark DataFrames
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)

# Construct the schema

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema

df = spark.createDataFrame(data, schema=schema)

INTRODUCTION TO PYSPARK
DataFrame operations - selection and filtering
Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions

Use .sort() to order by a collection of columns

# Select and show only the name and age columns

df.select("name", "age").show()

# Filter on age > 30

df.filter(df["age"] > 30).show()

# Use Where to filter match a specific value

df.where(df["age"] == 30).show()

INTRODUCTION TO PYSPARK
Sorting and dropping missing values
Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values

# Sort using the age column

df.sort("age", ascending=False).show()

# Drop missing values

df.na.drop().show()

INTRODUCTION TO PYSPARK
Cheatsheet
spark.read_json() : Load data from JSON

spark.read.schema() : Define schemas explicitly

.na.drop() : Drop rows with missing values

.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
GSCH003 - Rev04 24.11.2021
No ratings yet
GSCH003 - Rev04 24.11.2021
55 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
LabVIEW SVPWM for 3-Level Converters
No ratings yet
LabVIEW SVPWM for 3-Level Converters
61 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Pyspark
No ratings yet
Pyspark
10 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
Page 01
No ratings yet
Page 01
2 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pyspark Module 1
No ratings yet
Pyspark Module 1
63 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
16 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Python Pandas Guide for Data Analysts
No ratings yet
Python Pandas Guide for Data Analysts
37 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Data Frame
No ratings yet
Data Frame
95 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Python Intro Tut 16 Jun
No ratings yet
Python Intro Tut 16 Jun
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
Prediction Equations - 2
No ratings yet
Prediction Equations - 2
22 pages
Logistic Regressionand Regularization - 3
No ratings yet
Logistic Regressionand Regularization - 3
19 pages
Linear Classifiers in Python - 1
No ratings yet
Linear Classifiers in Python - 1
17 pages
CIS PostgreSQL 11 Benchmark v1.0.0 PDF
No ratings yet
CIS PostgreSQL 11 Benchmark v1.0.0 PDF
189 pages
Exploring Morecomplex Charts - 2
No ratings yet
Exploring Morecomplex Charts - 2
11 pages
Emerging Trends in Sales Management
100% (7)
Emerging Trends in Sales Management
14 pages
Epie Vs Ulat-Marredo
No ratings yet
Epie Vs Ulat-Marredo
1 page
Geogia Hotel Ghana LTD Vrs Silver Star Auto LTD (J4 34 of 2012) 2012 GHASC 54 (4 December 2012)
No ratings yet
Geogia Hotel Ghana LTD Vrs Silver Star Auto LTD (J4 34 of 2012) 2012 GHASC 54 (4 December 2012)
26 pages
Hydraulic Sealing Surface Insights
No ratings yet
Hydraulic Sealing Surface Insights
7 pages
SRQ 6 and 7 Strucutre and Model Sample
No ratings yet
SRQ 6 and 7 Strucutre and Model Sample
20 pages
Case Study
No ratings yet
Case Study
9 pages
Harsh Kumar Chetiwal-CV
No ratings yet
Harsh Kumar Chetiwal-CV
1 page
IIT BH - DNC Lab - EE - Manual - Expt 7
No ratings yet
IIT BH - DNC Lab - EE - Manual - Expt 7
1 page
Corporate Banking Analysis Guide
No ratings yet
Corporate Banking Analysis Guide
38 pages
Liability Insurance 24 Maret 2019 - 24 Maret 2020.2
No ratings yet
Liability Insurance 24 Maret 2019 - 24 Maret 2020.2
16 pages
L7805CV 5V, 1.5A, Voltage Regulator
No ratings yet
L7805CV 5V, 1.5A, Voltage Regulator
1 page
Презентация По Английском Языку На Тему - СМИ - (8 Класс)
No ratings yet
Презентация По Английском Языку На Тему - СМИ - (8 Класс)
13 pages
File Chinh Thuc - HSG 2020 - Vòng 2
No ratings yet
File Chinh Thuc - HSG 2020 - Vòng 2
17 pages
Consolidated Marksheet
No ratings yet
Consolidated Marksheet
3 pages
2019 - X - Important - Comparison of Change Management
No ratings yet
2019 - X - Important - Comparison of Change Management
20 pages
B.Com Management Exam Prep Guide
100% (1)
B.Com Management Exam Prep Guide
7 pages
Rotary Valve Fast Cycle Pressure Swing Adsorption Paper
No ratings yet
Rotary Valve Fast Cycle Pressure Swing Adsorption Paper
14 pages
Regulatory Environment For Food and Beverage in Brazil
No ratings yet
Regulatory Environment For Food and Beverage in Brazil
12 pages
NJVP2K8 App Slip
No ratings yet
NJVP2K8 App Slip
3 pages
Bharat Parekh
100% (1)
Bharat Parekh
3 pages
جودة المواقع PDF
No ratings yet
جودة المواقع PDF
25 pages
Volume 3 ENG
0% (1)
Volume 3 ENG
475 pages
Factors and Norms Influencing Unpaid Care Work
No ratings yet
Factors and Norms Influencing Unpaid Care Work
64 pages
Industrial Users Design Thinking
No ratings yet
Industrial Users Design Thinking
3 pages
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
No ratings yet
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
2 pages
Romantic Escapade - South & North Goa
No ratings yet
Romantic Escapade - South & North Goa
15 pages
LiFePO4 Battery Specs HP-50160282
No ratings yet
LiFePO4 Battery Specs HP-50160282
14 pages
Wainfleet Discharge of Guns Bylaw
No ratings yet
Wainfleet Discharge of Guns Bylaw
5 pages

1 - Introduction ToPySpark

Uploaded by

1 - Introduction ToPySpark

Uploaded by

Introduction to

Enthusiastic teacher of new tools for all!

Supports various data formats including CSV, Parquet, and JSON

Optimized for speed at scale

Real-time data streaming

Machine learning on large datasets

ETL and ELT pipelines

Working with diverse data sources:

.builder() sets up a session

getOrCreate() creates or retrieves a session

.appName() helps manage multiple sessions

# Import and initialize a Spark session

# Show the DataFrame

Supports SQL-like operations

Comparable to a Pandas Dataframe or a SQL TABLE

age education.num marital.status occupation income

# groupby() allows the use of sql-like aggregations

Other aggregate functions are:

.filter() : Filters rows based on specific conditions

.groupBy() : Groups rows based on one or more columns

.agg() : Applies aggregate functions to grouped data

JSON Files: Semi-structured, hierarchical

LongType: Larger whole numbers

FloatType and DoubleType: Floating-point numbers for decimal values

StringType: Used for text or string data

# Construct the schema

# Set the schema

Use .sort() to order by a collection of columns

# Select and show only the name and age columns

# Filter on age > 30

# Use Where to filter match a specific value

# Sort using the age column

# Drop missing values

spark.read.schema() : Define schemas explicitly

.na.drop() : Drop rows with missing values

.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions

You might also like