CHENCHU’S
C .R. Anil Kumar Reddy
Associate Developer for Apache Spark 3.0
🚀 Mastering PySpark and
Databricks 🚀
Optimization Techniques-3
Avoid inferSchema
www.linkedin.com/in/chenchuanil
CHENCHU’S
DataType of date has been provided as string instead of Date and
command took 1.87 seconds to complete execution.
www.linkedin.com/in/chenchuanil
CHENCHU’S
Why we should not use InferSchema
In PySpark, using inferSchema=True can indeed impact PySpark's lazy evaluation,
which is one of the core features of the Spark framework. Here’s how and why it
happens:
Understanding Lazy Evaluation in PySpark
Lazy evaluation means that Spark does not immediately compute the result of a
transformation (like map, filter, or select). Instead, it builds an execution plan and
waits until an action (such as show(), collect(), or write()) is called. This approach
optimizes execution by combining multiple transformations into a single stage,
reducing the amount of data read and shuffled.
How inferSchema=True affects Lazy Evaluation
When inferSchema=True is used, Spark needs to determine the data types of each
column. To do this, it must read part of the data to analyze it, which forces Spark to
execute part of the data loading process immediately—effectively "breaking" lazy
evaluation.
Performance Overhead Due to Immediate Execution
Since schema inference forces Spark to load part of the data before an action
is called, this step adds an extra overhead. Spark can no longer wait to batch
and optimize transformations, which can slow down the process, especially
with large files.
In scenarios where data files are numerous or large, the overhead of inferring
schemas for each file can become a bottleneck.
And also Blindly loading data by setting InferSchema= True can introduce issues,
especially if the data sent contains unexpected or "bad" data, such as incorrect
formats, extra columns, or invalid values. Here’s why a structured approach with
Change Requests (CR) and client discussions is essential.
www.linkedin.com/in/chenchuanil
CHENCHU’S
Explicit Schema Preserves Lazy Evaluation
By explicitly defining the schema (using StructType and StructField), you bypass the need for
Spark to inspect the data in advance, allowing Spark to delay reading the data until an action is
called. This keeps the entire execution pipeline lazy and optimizable, reducing unnecessary
computation and improving efficiency.
Defining a Schema in PySpark Using StructType and StructField
The below cells shows the definition of a schema for a DataFrame in PySpark using StructType
and StructField. The schema named unemployment_schema defines the structure of an
unemployment data table . Each field is specified with its data type (IntegerType, StringType,
DateType) . This schema is crucial for ensuring data consistency and defining explicit data
types when working with structured data in PySpark.
www.linkedin.com/in/chenchuanil
CHENCHU’S
We can observe how the time taken to read the file has reduced drastically while we
define schema(refer slide 2 for time taken without defining schema)
www.linkedin.com/in/chenchuanil
CHENCHU’S
Benefits of Defining Schema Explicitly in Spark
Avoids Reading the Entire Table:
Spark does not need to infer the schema by scanning the entire file, which reduces
initial computation and speeds up processing.
Preserves Lazy Evaluation:
Explicit schema definition ensures that Spark maintains lazy evaluation, as it does
not trigger unnecessary actions during schema inference.
Reduces Time Taken to Read the File:
By skipping the schema inference process, Spark reads the file faster since the
structure is already known.
Prevents Incorrect Formats:
Explicit schemas enforce the correct data types, preventing Spark from allowing
invalid or mismatched values.
Rejects Extra Columns:
Extra columns not part of the defined schema are ignored, ensuring only expected
data is processed.
Catches Invalid Values Early:
Invalid or corrupt data that does not match the defined schema will be flagged
immediately, improving data quality.
Explicitly defining schemas improves performance, enforces data integrity, and
avoids unnecessary overhead in Spark.
www.linkedin.com/in/chenchuanil
NIL REDDY CHENCHU CHENCHU’S
Torture the data, and it will confess to anything
DATA ANALYTICS
Happy Learning
SHARE IF YOU LIKE THE POST
Lets Connect to discuss more on Data
www.linkedin.com/in/chenchuanil