Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views3 pages

Spark Cheat Sheet

Cheat sheet

Uploaded by

nilk86209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

Spark Cheat Sheet

Cheat sheet

Uploaded by

nilk86209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Spark Zero to Hero - PySpark Cheat Sheet

1. Spark Architecture:

- Driver: Manages execution, schedules tasks.

- Executors: JVMs executing tasks on worker nodes.

- Tasks: Operate on partitions (1 task per partition).

- Cores: Each core runs one task.

2. Transformations:

- Narrow: One-to-one partition mapping (e.g., map, filter).

- Wide: Data shuffle occurs (e.g., groupByKey, join).

3. SparkSession:

- spark = SparkSession.builder.appName("App").getOrCreate()

- Local mode: .master("local[*]")

4. DataFrame Operations:

- withColumn(), lit(), withColumnRenamed()

- drop(), select(), selectExpr()

- limit(), distinct(), union(), unionAll()

5. Read/Write:

- CSV/JSON: spark.read.csv(), .json()

- Modes: permissive, dropMalformed, failFast

- Schema: inferSchema or define explicitly

- from_json(), to_json(), explode()


6. Performance:

- Coalesce vs Repartition

- Cache & persist (MEMORY_AND_DISK default)

- Shuffle partition default: 200

- AQE: Adaptive Query Execution (dynamic partition sizing)

7. Joins & Skew:

- Join leads to shuffle.

- Skew: Uneven data (handled with salting or AQE)

- Use skewJoin & advisoryPartitionSize settings

8. File Formats:

- Parquet (columnar, efficient), ORC, Avro

- Avoid high-cardinality columns in partitioning

9. Spark Submit:

--executor-cores, --total-executor-cores

YARN: --num-executors

Standalone: adjust cores manually

10. Catalog & Views:

- spark.catalog.listTables()

- Temp Views: df.createOrReplaceTempView("view_name")

11. Security:

- Store secrets in secure stores (Azure Key Vault for Databricks)


12. Optimization Tips:

- Z-ordering for multiple columns

- Avoid partitions on unique fields

- Partitioning folders: e.g., country=IN

You might also like