0% found this document useful (0 votes)

48 views7 pages

Avoid InferSchema

Avoid Infer schema for programming language spark

Uploaded by

sameergoswami86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views7 pages

Avoid InferSchema

Avoid Infer schema for programming language spark

Uploaded by

sameergoswami86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CHENCHU’S

C .R. Anil Kumar Reddy

Associate Developer for Apache Spark 3.0

🚀 Mastering PySpark and

Databricks 🚀

Optimization Techniques-3
Avoid inferSchema

www.linkedin.com/in/chenchuanil
CHENCHU’S

DataType of date has been provided as string instead of Date and

command took 1.87 seconds to complete execution.

www.linkedin.com/in/chenchuanil
CHENCHU’S

Why we should not use InferSchema

In PySpark, using inferSchema=True can indeed impact PySpark's lazy evaluation,
which is one of the core features of the Spark framework. Here’s how and why it
happens:

Understanding Lazy Evaluation in PySpark

Lazy evaluation means that Spark does not immediately compute the result of a
transformation (like map, filter, or select). Instead, it builds an execution plan and
waits until an action (such as show(), collect(), or write()) is called. This approach
optimizes execution by combining multiple transformations into a single stage,
reducing the amount of data read and shuffled.

How inferSchema=True affects Lazy Evaluation

When inferSchema=True is used, Spark needs to determine the data types of each
column. To do this, it must read part of the data to analyze it, which forces Spark to
execute part of the data loading process immediately—effectively "breaking" lazy
evaluation.

Performance Overhead Due to Immediate Execution

Since schema inference forces Spark to load part of the data before an action
is called, this step adds an extra overhead. Spark can no longer wait to batch
and optimize transformations, which can slow down the process, especially
with large files.

In scenarios where data files are numerous or large, the overhead of inferring
schemas for each file can become a bottleneck.

And also Blindly loading data by setting InferSchema= True can introduce issues,
especially if the data sent contains unexpected or "bad" data, such as incorrect
formats, extra columns, or invalid values. Here’s why a structured approach with
Change Requests (CR) and client discussions is essential.
www.linkedin.com/in/chenchuanil
CHENCHU’S

Explicit Schema Preserves Lazy Evaluation

By explicitly defining the schema (using StructType and StructField), you bypass the need for
Spark to inspect the data in advance, allowing Spark to delay reading the data until an action is
called. This keeps the entire execution pipeline lazy and optimizable, reducing unnecessary
computation and improving efficiency.

Defining a Schema in PySpark Using StructType and StructField

The below cells shows the definition of a schema for a DataFrame in PySpark using StructType
and StructField. The schema named unemployment_schema defines the structure of an
unemployment data table . Each field is specified with its data type (IntegerType, StringType,
DateType) . This schema is crucial for ensuring data consistency and defining explicit data
types when working with structured data in PySpark.

www.linkedin.com/in/chenchuanil
CHENCHU’S

We can observe how the time taken to read the file has reduced drastically while we
define schema(refer slide 2 for time taken without defining schema)

www.linkedin.com/in/chenchuanil
CHENCHU’S

Benefits of Defining Schema Explicitly in Spark

Avoids Reading the Entire Table:

Spark does not need to infer the schema by scanning the entire file, which reduces
initial computation and speeds up processing.

Preserves Lazy Evaluation:

Explicit schema definition ensures that Spark maintains lazy evaluation, as it does
not trigger unnecessary actions during schema inference.

Reduces Time Taken to Read the File:

By skipping the schema inference process, Spark reads the file faster since the
structure is already known.

Prevents Incorrect Formats:

Explicit schemas enforce the correct data types, preventing Spark from allowing
invalid or mismatched values.

Rejects Extra Columns:

Extra columns not part of the defined schema are ignored, ensuring only expected
data is processed.

Catches Invalid Values Early:

Invalid or corrupt data that does not match the defined schema will be flagged
immediately, improving data quality.

Explicitly defining schemas improves performance, enforces data integrity, and

avoids unnecessary overhead in Spark.

www.linkedin.com/in/chenchuanil
NIL REDDY CHENCHU CHENCHU’S

Torture the data, and it will confess to anything

DATA ANALYTICS

Happy Learning

SHARE IF YOU LIKE THE POST

Lets Connect to discuss more on Data

www.linkedin.com/in/chenchuanil

Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
No ratings yet
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
6 pages
Snowflake Notes
100% (10)
Snowflake Notes
67 pages
Saikiran Data - Engineer Resume
No ratings yet
Saikiran Data - Engineer Resume
7 pages
Snowflake Scenario Based Interview Questions
100% (2)
Snowflake Scenario Based Interview Questions
20 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Azure Databricks
67% (6)
Azure Databricks
69 pages
Azure Data Engineer Resume - Hire IT People - We Get IT Done
100% (1)
Azure Data Engineer Resume - Hire IT People - We Get IT Done
4 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Azure Data Factory
77% (13)
Azure Data Factory
52 pages
Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Applied Microsoft Power BI Bring Your Data To Life
100% (14)
Applied Microsoft Power BI Bring Your Data To Life
592 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
100% (7)
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
446 pages
SQL Interview Questions & Answers
75% (4)
SQL Interview Questions & Answers
63 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Adding StructType Columns To Spark DataFrames
No ratings yet
Adding StructType Columns To Spark DataFrames
6 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pyspark Module 1
No ratings yet
Pyspark Module 1
63 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Pyspark
No ratings yet
Pyspark
10 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Py Spark
No ratings yet
Py Spark
9 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Bda U5
No ratings yet
Bda U5
42 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Py Spark
No ratings yet
Py Spark
177 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
50 EDA Interview Questions GHB
No ratings yet
50 EDA Interview Questions GHB
12 pages
Page 01
No ratings yet
Page 01
2 pages
PySpark StructType StructField Explained 1722792510
No ratings yet
PySpark StructType StructField Explained 1722792510
6 pages
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
No ratings yet
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
36 pages
Dataset - Databricks
No ratings yet
Dataset - Databricks
5 pages
Spark Revision
No ratings yet
Spark Revision
16 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
07 Structured Data Processing
No ratings yet
07 Structured Data Processing
91 pages
PySpark CSV & Excel Guide in Databricks
No ratings yet
PySpark CSV & Excel Guide in Databricks
4 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark DataFrame Operations
No ratings yet
PySpark DataFrame Operations
103 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
Azure Databricks Guide: CSV & SQL Integration
No ratings yet
Azure Databricks Guide: CSV & SQL Integration
16 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Azure Data Factory Interview Questions
100% (1)
Azure Data Factory Interview Questions
33 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Pyspark Practice Day 12 For Spark
No ratings yet
Pyspark Practice Day 12 For Spark
9 pages
Power BI Updates - February 2024
No ratings yet
Power BI Updates - February 2024
31 pages
Azure Data Engineer Interview Guide
No ratings yet
Azure Data Engineer Interview Guide
15 pages
What Is Do-While Loop
No ratings yet
What Is Do-While Loop
4 pages
Space Systems - Responsive Missions
No ratings yet
Space Systems - Responsive Missions
2 pages
My Resume
No ratings yet
My Resume
1 page
Gemcom Minex: New Features
No ratings yet
Gemcom Minex: New Features
13 pages
TCS NQT Prep Guide
No ratings yet
TCS NQT Prep Guide
156 pages
Technical Service Bulletin: Group
No ratings yet
Technical Service Bulletin: Group
9 pages
Web Practical
No ratings yet
Web Practical
37 pages
Pipe Risers and Their Supports
No ratings yet
Pipe Risers and Their Supports
4 pages
Industrial Check Valves Guide
No ratings yet
Industrial Check Valves Guide
8 pages
Agip GR SLL 00
No ratings yet
Agip GR SLL 00
1 page
Engineering Student Project Report
No ratings yet
Engineering Student Project Report
17 pages
PolyJet Print-Head Claim Procedure
No ratings yet
PolyJet Print-Head Claim Procedure
3 pages
Assignment 1 - Linear Programming I - With Answers
No ratings yet
Assignment 1 - Linear Programming I - With Answers
2 pages
HER3001PT
No ratings yet
HER3001PT
2 pages
Tutorial Session 10 Autocorrelation Solution
No ratings yet
Tutorial Session 10 Autocorrelation Solution
4 pages
20I6001 - Aaradhya - Ghota - Assignment 1
No ratings yet
20I6001 - Aaradhya - Ghota - Assignment 1
20 pages
Facial Expression Based Music Recommendation System
No ratings yet
Facial Expression Based Music Recommendation System
10 pages
CHANDRA DZDA STAT6174037 ProbabilityTheoryandAppliedStatistics
No ratings yet
CHANDRA DZDA STAT6174037 ProbabilityTheoryandAppliedStatistics
17 pages
SEO Directory and Bookmarking List
No ratings yet
SEO Directory and Bookmarking List
6 pages
Narrative Report
No ratings yet
Narrative Report
2 pages
Economics Thesis Blue Variant
No ratings yet
Economics Thesis Blue Variant
38 pages
Ovi R
No ratings yet
Ovi R
2 pages
Rohde and Schwarz TSMA6B - Bro - en - 3609-5622-12 - v0600
No ratings yet
Rohde and Schwarz TSMA6B - Bro - en - 3609-5622-12 - v0600
26 pages
1 s2.0 S0141029619311046 Main
No ratings yet
1 s2.0 S0141029619311046 Main
11 pages
VOSviewer: Advanced Text Mining
No ratings yet
VOSviewer: Advanced Text Mining
5 pages
6.0 SNI Ultrasonic Transducer Catalog Ver. 6.0 1
No ratings yet
6.0 SNI Ultrasonic Transducer Catalog Ver. 6.0 1
60 pages
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
No ratings yet
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
4 pages
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
N - Channel Enhancement Mode " Single Feature Size " Power Mosfet
No ratings yet
N - Channel Enhancement Mode " Single Feature Size " Power Mosfet
9 pages
Sample
No ratings yet
Sample
7 pages
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
No ratings yet
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
30 pages

Avoid InferSchema

Uploaded by

Avoid InferSchema

Uploaded by

CHENCHU’S

C .R. Anil Kumar Reddy

🚀 Mastering PySpark and

DataType of date has been provided as string instead of Date and

Why we should not use InferSchema

Understanding Lazy Evaluation in PySpark

How inferSchema=True affects Lazy Evaluation

Performance Overhead Due to Immediate Execution

Explicit Schema Preserves Lazy Evaluation

Defining a Schema in PySpark Using StructType and StructField

Benefits of Defining Schema Explicitly in Spark

Avoids Reading the Entire Table:

Preserves Lazy Evaluation:

Reduces Time Taken to Read the File:

Prevents Incorrect Formats:

Rejects Extra Columns:

Catches Invalid Values Early:

Explicitly defining schemas improves performance, enforces data integrity, and

Torture the data, and it will confess to anything

SHARE IF YOU LIKE THE POST

Lets Connect to discuss more on Data

You might also like