Spark Zero to Hero - PySpark Cheat Sheet
1. Spark Architecture:
- Driver: Manages execution, schedules tasks.
- Executors: JVMs executing tasks on worker nodes.
- Tasks: Operate on partitions (1 task per partition).
- Cores: Each core runs one task.
2. Transformations:
- Narrow: One-to-one partition mapping (e.g., map, filter).
- Wide: Data shuffle occurs (e.g., groupByKey, join).
3. SparkSession:
- spark = SparkSession.builder.appName("App").getOrCreate()
- Local mode: .master("local[*]")
4. DataFrame Operations:
- withColumn(), lit(), withColumnRenamed()
- drop(), select(), selectExpr()
- limit(), distinct(), union(), unionAll()
5. Read/Write:
- CSV/JSON: spark.read.csv(), .json()
- Modes: permissive, dropMalformed, failFast
- Schema: inferSchema or define explicitly
- from_json(), to_json(), explode()
6. Performance:
- Coalesce vs Repartition
- Cache & persist (MEMORY_AND_DISK default)
- Shuffle partition default: 200
- AQE: Adaptive Query Execution (dynamic partition sizing)
7. Joins & Skew:
- Join leads to shuffle.
- Skew: Uneven data (handled with salting or AQE)
- Use skewJoin & advisoryPartitionSize settings
8. File Formats:
- Parquet (columnar, efficient), ORC, Avro
- Avoid high-cardinality columns in partitioning
9. Spark Submit:
--executor-cores, --total-executor-cores
YARN: --num-executors
Standalone: adjust cores manually
10. Catalog & Views:
- spark.catalog.listTables()
- Temp Views: df.createOrReplaceTempView("view_name")
11. Security:
- Store secrets in secure stores (Azure Key Vault for Databricks)
12. Optimization Tips:
- Z-ordering for multiple columns
- Avoid partitions on unique fields
- Partitioning folders: e.g., country=IN