engine
11 September 2025 13:30
Spark SQL Engine (Detailed Explanation)
1. Introduction
Spark SQL is the structured data processing engine in Apache Spark.
It lets you run SQL queries, work with DataFrames/Datasets, and provides a unified API to process structured and semi-structured data at scale.
Internally, it uses two key components:
• Catalyst Optimizer → Optimizes query plans.
• Tungsten Execution Engine → Handles efficient execution (memory & CPU).
2. Execution Flow of a Query in Spark SQL
When you run a SQL/DataFrame query, Spark SQL goes through several stages before execution:
Step 1: Parsing → Unresolved Logical Plan
• Your query is first parsed by a parser (built on ANTLR).
• It checks for syntax correctness (e.g., missing keywords, commas).
• The parser generates an Unresolved Logical Plan:
○ Represents the query structure.
○ At this stage, table/column names are not yet verified.
○ Example: If you wrote SELECT namee FROM employees, it won’t catch that namee column doesn’t exist yet.
Step 2: Analysis → Resolved Logical Plan
• The Analyzer takes the unresolved plan.
• It uses the Catalog (metadata) to resolve tables, columns, and functions.
• Checks for:
○ Whether tables/columns exist.
○ Whether data types are compatible.
• Output → Resolved Logical Plan (all references now mapped to actual data schema).
Step 3: Optimization → Optimized Logical Plan
• The Catalyst Optimizer applies a set of rule-based and cost-based optimizations:
○ Predicate Pushdown → Push filters close to data source.
○ Constant Folding → Simplify expressions (2+3 → 5).
○ Projection Pruning → Read only required columns.
○ Join Reordering → Choose best join sequence.
• Output → Optimized Logical Plan (a better but still abstract plan).
Step 4: Physical Planning → Physical Plan(s)
• The Planner translates the optimized logical plan into one or more physical plans (actual execution strategies).
• Examples:
○ Join could be done via Broadcast Hash Join or Sort-Merge Join.
• The Cost Model evaluates and picks the best plan.
• Output → Final Physical Plan.
Step 5: Code Generation & Execution
• Spark uses Tungsten Engine and Whole-Stage Code Generation:
○ Converts parts of the physical plan into optimized Java bytecode.
○ Improves CPU efficiency and avoids JVM overhead.
• The plan is executed in parallel on Spark executors using RDDs and tasks.
• Final output is returned as a DataFrame/Table/ResultSet.
3. Plan Types in Spark SQL
Here’s the professional breakdown you wanted:
1. Unresolved Logical Plan → Generated after parsing, contains query structure but unresolved references.
2. Resolved Logical Plan → After Analyzer step, all columns, tables, and functions are verified using metadata.
3. Optimized Logical Plan → Catalyst Optimizer applies optimization rules for efficiency.
4. Physical Plan(s) → Multiple execution strategies are generated, cost model chooses the best one.
5. Final Execution Plan → Sent to Spark Core for distributed execution.
4. Key Components
• Catalyst Optimizer → Rule-based + cost-based query optimization.
• Tungsten Execution Engine → Handles memory management, caching, whole-stage codegen.
• Catalog → Metadata store for tables, columns, and schemas.
• Data Sources API → Enables reading from Hive, Parquet, ORC, JSON, JDBC, Delta, etc.
5. Why Spark SQL is Powerful
• Unified access via SQL, DataFrames, and Datasets.
• Advanced query optimization (Catalyst).
• High performance execution (Tungsten + CodeGen).
• Works across structured & semi-structured data.
• Connects with BI tools (Power BI, Tableau, JDBC).
✅ In short:
Spark SQL Engine converts a query into multiple plans — unresolved → resolved → optimized logical → physical plan → execution.
Catalyst Optimizer + Tungsten Execution together make Spark SQL fast, scalable, and efficient.
pyspark Page 1
Catalyst Optimizer + Tungsten Execution together make Spark SQL fast, scalable, and efficient.
pyspark Page 2
pyspark Page 3
pyspark Page 4