0% found this document useful (0 votes)

5 views4 pages

PySpark Notes

The document provides a comprehensive overview of PySpark functions, covering RDD creation, transformations, actions, DataFrame methods, SQL operations, date and time functions, window functions, and storage levels. It details various methods for manipulating and processing data using PySpark, including both narrow and wide transformations, as well as optimization techniques for performance. Additionally, it outlines persistence methods and the importance of partitioning in data processing.

Uploaded by

Govind Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

PySpark Notes

Uploaded by

Govind Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

PySpark Functions Summary

PySpark Core (RDD Functions)

RDD Creation and Loading

sc.textFile() - Load text file into RDD

sc.parallelize() - Create RDD from collection with partitions

range() - Generate range of numbers

Transformations (Narrow)
map() - Apply function to each element

flatMap() - Apply function and flatten results

filter() - Filter elements based on condition

distinct() - Remove duplicates

union() - Combine RDDs

sortBy() - Sort by key with random key sort

mapPartitions() - Apply function to each partition

Transformations (Wide)
groupByKey() - Group by key (Note: aggregation by key, must use map)

reduceByKey() - Reduce by key (Note: perform local aggregation before final reduce, very less
shuffling, high performance)
sortByKey() - Sort by key (Note: sorting in each partition is done separately, then results are merged)

join() - Inner join

rightOuterJoin() - Right outer join

leftOuterJoin() - Left outer join

cogroup() - Group multiple RDDs

Transformations on Paired RDDs

mapValues() - Apply function to values only

keys() - Extract keys

values() - Extract values

Actions
collect() - Return RDD as list
take(n) - Take first n records

first() - Take only 1st element

count() - Count number of records in RDD

reduce() - Aggregate elements (only commutative and associative operations allowed)

saveAsTextFile() - Save RDD to text file

countByKey() - Count occurrences, return FlatMap

collectAsMap() - Convert to Python dictionary

Spark Joins
join() - Inner join

rightOuterJoin() - Key present in 1st RDD

leftOuterJoin() - Key present in 2nd RDD

PySpark DataFrame

DataFrame Creation
From collections: createDataFrame()
Import named tuple: NamedTuple

From Spark RDD: toDF()

DataFrame Methods
toDf() - Convert to DataFrame

createDataFrame() - Create DataFrame from RDD

show() - Display DataFrame

select() - Select specific columns

filter() - Filter rows based on condition

where() - Alternative to filter

groupBy() - Group by column, then apply aggregation (sum, max, count)

agg() - Aggregation function

limit(n) - First n records

distinct() - Unique values from column, return DataFrame

orderBy() - Sort by column (False = Descending)

printSchema() - Display DataFrame schema

DataFrame Metadata
df.columns - Column names
df.dtypes - Data types of each column

df.schema - Structure with column and type

PySpark SQL

SQL Context and Temporary Views

createOrReplaceTempView() - Create temporary view (table)

spark.sql() - Execute SQL queries

Date and Time Functions

current_date() - Get current date

current_date().show(n) - Limited rows

date_format() - Format date to other format

to_date() - Convert string to date

date_add() - Add specific days to date column

date_sub() - Subtract specific days from column

months_between() - Similar to date_diff() but months difference in float

year() , month() , next_day() , week_of_year() - Extract date parts from given date column

current_timestamp() - Current timestamp

hour() , minute() , second() - Same as year(), month()

to_timestamp() - Convert string to timestamp

Window Functions
Used with frame and partition concepts

Storage Levels and Persistence

Persistence Methods
cache() - Cache RDD (doesn't take any parameter)

persist() - Same as persist(pyspark, StorageLevel, MEMORY_ONLY)

Storage Levels
1. MEMORY_ONLY - Only in RAM

2. MEMORY_AND_DISK - If RAM full, use DISK

3. MEMORY_ONLY_SER - RDD as serialized Java object only on RAM

4. MEMORY_AND_DISK_SER - Serialized Java object on RAM, if RAM full then DISK

5. DISK_ONLY - Only DISK

6. MEMORY_ONLY_2 - Same as level above but replicate each partition on 2 cluster nodes
7. MEMORY_AND_DISK_2 - Similar concept

Key Concepts

Partitioning
Data distribution across partitions (P1, P2, P3)

Narrow vs Wide Transformations

Shuffling operations and performance impact

Optimization Techniques
RDD persistence and caching

Avoiding wide transformations when possible

Using appropriate storage levels

Network congestion and performance issues with shuffling

PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
Change IMEI
50% (2)
Change IMEI
11 pages
NGeniusPULSE User Guide v3.10
No ratings yet
NGeniusPULSE User Guide v3.10
260 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Py Spark
No ratings yet
Py Spark
19 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Pyspark
No ratings yet
Pyspark
31 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
RDD
No ratings yet
RDD
4 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark RDD Functions Guide
No ratings yet
PySpark RDD Functions Guide
1 page
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
PySpark RDD: Transformations & Operations
No ratings yet
PySpark RDD: Transformations & Operations
40 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
SPARK
No ratings yet
SPARK
35 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Pyspark
No ratings yet
Pyspark
44 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Certified Data Entry and Office Assistant (Upskilling) Curriculum
No ratings yet
Certified Data Entry and Office Assistant (Upskilling) Curriculum
9 pages
Lab Manual - Explore The OSI and TCP-IP Models in Action (Instructor)
No ratings yet
Lab Manual - Explore The OSI and TCP-IP Models in Action (Instructor)
5 pages
Chroma Kopi A
No ratings yet
Chroma Kopi A
1 page
Hikvision DVR/NVR Quick Start Guide
No ratings yet
Hikvision DVR/NVR Quick Start Guide
28 pages
ERONET
No ratings yet
ERONET
4 pages
GATE 2026 Study Plan With Mocks and Revision
No ratings yet
GATE 2026 Study Plan With Mocks and Revision
2 pages
F24 MVC Assignment 2
No ratings yet
F24 MVC Assignment 2
11 pages
Feeder Protection Relay
No ratings yet
Feeder Protection Relay
16 pages
Flaresim Getting Started
No ratings yet
Flaresim Getting Started
116 pages
saveEditorPS4 en Manual
No ratings yet
saveEditorPS4 en Manual
34 pages
AC1L AC3L Linux Manual BrosTrend WiFI Adapter v4
No ratings yet
AC1L AC3L Linux Manual BrosTrend WiFI Adapter v4
2 pages
ST Francis Catholic College, Edmondson Park: Term 1 Parent/Teacher/Student Interviews (Yrs 1-10)
No ratings yet
ST Francis Catholic College, Edmondson Park: Term 1 Parent/Teacher/Student Interviews (Yrs 1-10)
2 pages
MCQ Copa
No ratings yet
MCQ Copa
16 pages
Pro Jakarta EE 10: Open Source Enterprise Java-Based Cloud-Native Applications Development Peter Späth Instant Download
No ratings yet
Pro Jakarta EE 10: Open Source Enterprise Java-Based Cloud-Native Applications Development Peter Späth Instant Download
36 pages
Google Ads Strategy Guide
No ratings yet
Google Ads Strategy Guide
5 pages
Actifio On Vault Configuration
No ratings yet
Actifio On Vault Configuration
34 pages
Ankit Bansal Data Science 24
No ratings yet
Ankit Bansal Data Science 24
1 page
Security Consultant Guide With Code Examples
No ratings yet
Security Consultant Guide With Code Examples
12 pages
Road Blocker Crash Tested HRR-HS-4100 Road Blocker. Pas
No ratings yet
Road Blocker Crash Tested HRR-HS-4100 Road Blocker. Pas
28 pages
UVI Soundbank EULA & User Guide
No ratings yet
UVI Soundbank EULA & User Guide
10 pages
Design Database Main Module Simplified Tvet
No ratings yet
Design Database Main Module Simplified Tvet
238 pages
Acknowledgement: Niharika Sharma XII Science
No ratings yet
Acknowledgement: Niharika Sharma XII Science
24 pages
Understanding Web API
100% (2)
Understanding Web API
12 pages
Priyanka - S Resume DataScientist
No ratings yet
Priyanka - S Resume DataScientist
1 page
Sample Questions For Midterm Exam - CSE215 (Sec 19) - Spring2024
No ratings yet
Sample Questions For Midterm Exam - CSE215 (Sec 19) - Spring2024
2 pages
ITB 2024 Notes 2
No ratings yet
ITB 2024 Notes 2
8 pages
08 - CA (CL) - 20th Batch (Sec-B) - IT - LO6 (Part) - Internet of Things (IoT) - CLass - 10 (29102024)
No ratings yet
08 - CA (CL) - 20th Batch (Sec-B) - IT - LO6 (Part) - Internet of Things (IoT) - CLass - 10 (29102024)
16 pages

PySpark Notes

Uploaded by

PySpark Notes

Uploaded by

PySpark Functions Summary

PySpark Core (RDD Functions)

RDD Creation and Loading

sc.parallelize() - Create RDD from collection with partitions

range() - Generate range of numbers

flatMap() - Apply function and flatten results

filter() - Filter elements based on condition

distinct() - Remove duplicates

union() - Combine RDDs

sortBy() - Sort by key with random key sort

mapPartitions() - Apply function to each partition

join() - Inner join

rightOuterJoin() - Right outer join

leftOuterJoin() - Left outer join

cogroup() - Group multiple RDDs

Transformations on Paired RDDs

keys() - Extract keys

values() - Extract values

first() - Take only 1st element

count() - Count number of records in RDD

reduce() - Aggregate elements (only commutative and associative operations allowed)

saveAsTextFile() - Save RDD to text file

countByKey() - Count occurrences, return FlatMap

collectAsMap() - Convert to Python dictionary

rightOuterJoin() - Key present in 1st RDD

leftOuterJoin() - Key present in 2nd RDD

From Spark RDD: toDF()

createDataFrame() - Create DataFrame from RDD

show() - Display DataFrame

select() - Select specific columns

filter() - Filter rows based on condition

where() - Alternative to filter

groupBy() - Group by column, then apply aggregation (sum, max, count)

agg() - Aggregation function

limit(n) - First n records

distinct() - Unique values from column, return DataFrame

orderBy() - Sort by column (False = Descending)

printSchema() - Display DataFrame schema

df.schema - Structure with column and type

SQL Context and Temporary Views

spark.sql() - Execute SQL queries

Date and Time Functions

current_date().show(n) - Limited rows

date_format() - Format date to other format

to_date() - Convert string to date

date_add() - Add specific days to date column

date_sub() - Subtract specific days from column

months_between() - Similar to date_diff() but months difference in float

current_timestamp() - Current timestamp

hour() , minute() , second() - Same as year(), month()

to_timestamp() - Convert string to timestamp

Storage Levels and Persistence

persist() - Same as persist(pyspark, StorageLevel, MEMORY_ONLY)

2. MEMORY_AND_DISK - If RAM full, use DISK

3. MEMORY_ONLY_SER - RDD as serialized Java object only on RAM

4. MEMORY_AND_DISK_SER - Serialized Java object on RAM, if RAM full then DISK

5. DISK_ONLY - Only DISK

Narrow vs Wide Transformations

Shuffling operations and performance impact

Avoiding wide transformations when possible

Using appropriate storage levels

Network congestion and performance issues with shuffling

You might also like