Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views4 pages

Engine

Uploaded by

aliya.pathan0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Engine

Uploaded by

aliya.pathan0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

engine

11 September 2025 13:30

Spark SQL Engine (Detailed Explanation)


1. Introduction
Spark SQL is the structured data processing engine in Apache Spark.
It lets you run SQL queries, work with DataFrames/Datasets, and provides a unified API to process structured and semi-structured data at scale.
Internally, it uses two key components:
• Catalyst Optimizer → Optimizes query plans.
• Tungsten Execution Engine → Handles efficient execution (memory & CPU).

2. Execution Flow of a Query in Spark SQL


When you run a SQL/DataFrame query, Spark SQL goes through several stages before execution:

Step 1: Parsing → Unresolved Logical Plan


• Your query is first parsed by a parser (built on ANTLR).
• It checks for syntax correctness (e.g., missing keywords, commas).
• The parser generates an Unresolved Logical Plan:
○ Represents the query structure.
○ At this stage, table/column names are not yet verified.
○ Example: If you wrote SELECT namee FROM employees, it won’t catch that namee column doesn’t exist yet.

Step 2: Analysis → Resolved Logical Plan


• The Analyzer takes the unresolved plan.
• It uses the Catalog (metadata) to resolve tables, columns, and functions.
• Checks for:
○ Whether tables/columns exist.
○ Whether data types are compatible.
• Output → Resolved Logical Plan (all references now mapped to actual data schema).

Step 3: Optimization → Optimized Logical Plan


• The Catalyst Optimizer applies a set of rule-based and cost-based optimizations:
○ Predicate Pushdown → Push filters close to data source.
○ Constant Folding → Simplify expressions (2+3 → 5).
○ Projection Pruning → Read only required columns.
○ Join Reordering → Choose best join sequence.
• Output → Optimized Logical Plan (a better but still abstract plan).

Step 4: Physical Planning → Physical Plan(s)


• The Planner translates the optimized logical plan into one or more physical plans (actual execution strategies).
• Examples:
○ Join could be done via Broadcast Hash Join or Sort-Merge Join.
• The Cost Model evaluates and picks the best plan.
• Output → Final Physical Plan.

Step 5: Code Generation & Execution


• Spark uses Tungsten Engine and Whole-Stage Code Generation:
○ Converts parts of the physical plan into optimized Java bytecode.
○ Improves CPU efficiency and avoids JVM overhead.
• The plan is executed in parallel on Spark executors using RDDs and tasks.
• Final output is returned as a DataFrame/Table/ResultSet.

3. Plan Types in Spark SQL


Here’s the professional breakdown you wanted:
1. Unresolved Logical Plan → Generated after parsing, contains query structure but unresolved references.
2. Resolved Logical Plan → After Analyzer step, all columns, tables, and functions are verified using metadata.
3. Optimized Logical Plan → Catalyst Optimizer applies optimization rules for efficiency.
4. Physical Plan(s) → Multiple execution strategies are generated, cost model chooses the best one.
5. Final Execution Plan → Sent to Spark Core for distributed execution.

4. Key Components
• Catalyst Optimizer → Rule-based + cost-based query optimization.
• Tungsten Execution Engine → Handles memory management, caching, whole-stage codegen.
• Catalog → Metadata store for tables, columns, and schemas.
• Data Sources API → Enables reading from Hive, Parquet, ORC, JSON, JDBC, Delta, etc.

5. Why Spark SQL is Powerful


• Unified access via SQL, DataFrames, and Datasets.
• Advanced query optimization (Catalyst).
• High performance execution (Tungsten + CodeGen).
• Works across structured & semi-structured data.
• Connects with BI tools (Power BI, Tableau, JDBC).

✅ In short:
Spark SQL Engine converts a query into multiple plans — unresolved → resolved → optimized logical → physical plan → execution.
Catalyst Optimizer + Tungsten Execution together make Spark SQL fast, scalable, and efficient.

pyspark Page 1
Catalyst Optimizer + Tungsten Execution together make Spark SQL fast, scalable, and efficient.

pyspark Page 2
pyspark Page 3
pyspark Page 4

You might also like