0% found this document useful (0 votes)

10 views3 pages

Spark Cheat Sheet

Cheat sheet

Uploaded by

nilk86209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views3 pages

Spark Cheat Sheet

Cheat sheet

Uploaded by

nilk86209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Spark Zero to Hero - PySpark Cheat Sheet

1. Spark Architecture:

- Driver: Manages execution, schedules tasks.

- Executors: JVMs executing tasks on worker nodes.

- Tasks: Operate on partitions (1 task per partition).

- Cores: Each core runs one task.

2. Transformations:

- Narrow: One-to-one partition mapping (e.g., map, filter).

- Wide: Data shuffle occurs (e.g., groupByKey, join).

3. SparkSession:

- spark = SparkSession.builder.appName("App").getOrCreate()

- Local mode: .master("local[*]")

4. DataFrame Operations:

- withColumn(), lit(), withColumnRenamed()

- drop(), select(), selectExpr()

- limit(), distinct(), union(), unionAll()

5. Read/Write:

- CSV/JSON: spark.read.csv(), .json()

- Modes: permissive, dropMalformed, failFast

- Schema: inferSchema or define explicitly

- from_json(), to_json(), explode()

6. Performance:

- Coalesce vs Repartition

- Cache & persist (MEMORY_AND_DISK default)

- Shuffle partition default: 200

- AQE: Adaptive Query Execution (dynamic partition sizing)

7. Joins & Skew:

- Join leads to shuffle.

- Skew: Uneven data (handled with salting or AQE)

- Use skewJoin & advisoryPartitionSize settings

8. File Formats:

- Parquet (columnar, efficient), ORC, Avro

- Avoid high-cardinality columns in partitioning

9. Spark Submit:

--executor-cores, --total-executor-cores

YARN: --num-executors

Standalone: adjust cores manually

10. Catalog & Views:

- spark.catalog.listTables()

- Temp Views: df.createOrReplaceTempView("view_name")

11. Security:

- Store secrets in secure stores (Azure Key Vault for Databricks)

12. Optimization Tips:

- Z-ordering for multiple columns

- Avoid partitions on unique fields

- Partitioning folders: e.g., country=IN

PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Pyspark Cheat Sheet
No ratings yet
Pyspark Cheat Sheet
4 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
Pyspark
No ratings yet
Pyspark
6 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Spark QA
No ratings yet
Spark QA
34 pages
Py Spark
No ratings yet
Py Spark
9 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
PySpark Performance Optimization PDF
No ratings yet
PySpark Performance Optimization PDF
7 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Pyspark
100% (1)
Pyspark
48 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Spark Driver Role & Data Skew Solutions
No ratings yet
Spark Driver Role & Data Skew Solutions
33 pages
Understanding Shuffling in PySpark
No ratings yet
Understanding Shuffling in PySpark
3 pages
Execr
No ratings yet
Execr
4 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
PySpark Interview Questions Shubham
No ratings yet
PySpark Interview Questions Shubham
3 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
PySpark Optimization Techniques For Data Engineers
No ratings yet
PySpark Optimization Techniques For Data Engineers
1 page
RDD
No ratings yet
RDD
4 pages
Pysparkq
No ratings yet
Pysparkq
3 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
Complete Data Engineer Interview Guide
No ratings yet
Complete Data Engineer Interview Guide
3 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
PySpark RDD Functions Guide
No ratings yet
PySpark RDD Functions Guide
1 page
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Py Spark
No ratings yet
Py Spark
3 pages
B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
GitHub - Cartershanklin - Pyspark-Cheatsheet - PySpark Cheat Sheet - Example Code To Help You Learn PySpark and Develop Apps Faster
No ratings yet
GitHub - Cartershanklin - Pyspark-Cheatsheet - PySpark Cheat Sheet - Example Code To Help You Learn PySpark and Develop Apps Faster
173 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark
No ratings yet
Pyspark
4 pages
Apache Spark Technical Round Dashboard
No ratings yet
Apache Spark Technical Round Dashboard
14 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages

Spark Cheat Sheet

Uploaded by

Spark Cheat Sheet

Uploaded by

Spark Zero to Hero - PySpark Cheat Sheet

- Driver: Manages execution, schedules tasks.

- Executors: JVMs executing tasks on worker nodes.

- Tasks: Operate on partitions (1 task per partition).

- Cores: Each core runs one task.

- Narrow: One-to-one partition mapping (e.g., map, filter).

- Wide: Data shuffle occurs (e.g., groupByKey, join).

- Local mode: .master("local[*]")

- withColumn(), lit(), withColumnRenamed()

- drop(), select(), selectExpr()

- limit(), distinct(), union(), unionAll()

- CSV/JSON: spark.read.csv(), .json()

- Modes: permissive, dropMalformed, failFast

- Schema: inferSchema or define explicitly

- from_json(), to_json(), explode()

- Cache & persist (MEMORY_AND_DISK default)

- Shuffle partition default: 200

- AQE: Adaptive Query Execution (dynamic partition sizing)

7. Joins & Skew:

- Join leads to shuffle.

- Skew: Uneven data (handled with salting or AQE)

- Use skewJoin & advisoryPartitionSize settings

- Parquet (columnar, efficient), ORC, Avro

- Avoid high-cardinality columns in partitioning

Standalone: adjust cores manually

10. Catalog & Views:

- Temp Views: df.createOrReplaceTempView("view_name")

- Store secrets in secure stores (Azure Key Vault for Databricks)

- Z-ordering for multiple columns

- Avoid partitions on unique fields

- Partitioning folders: e.g., country=IN

You might also like