0% found this document useful (0 votes)

67 views10 pages

Pyspark Cheatsheet

The document provides a comprehensive cheat sheet for PySpark, detailing its capabilities for processing big data using Python. It includes instructions for reading and writing various data formats, data exploration, cleaning, manipulation, filtering, aggregation, sorting, and joining datasets. Each section contains specific code snippets and functions to facilitate data handling in PySpark.

Uploaded by

sanskarnotofficials

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views10 pages

Pyspark Cheatsheet

Uploaded by

sanskarnotofficials

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

coding_knowladge

Harry

Spark
CheatSheet
coding_knowladge
Harry

PySpark ?
PySpark is a tool that lets you use Python
to work with big data using Apache Spark.
It helps you process huge amounts of
data quickly and in parallel across many
computers. Think of it like pandas for big
data — but faster and built for scale. You
can use PySpark to:

Clean and analyze large datasets

Run SQL queries on big data
Build machine learning models
Handle real-time data

from pyspark.sql import SparkSession

spark =
SparkSession.builder.appName("App").
getOrCreate()
coding_knowladge
Harry

Reading & Writing Data

spark.read.csv("file.csv", header=True,
inferSchema=True): Read CSV with
headers and infer schema.

spark.read.json("file.json"): Read JSON file

into DataFrame.

spark.read.parquet("file.parquet") : Read
Parquet format file.

spark.read.option("multiLine",
True).json("file.json"): Handle multi-line
JSON.

spark.read.text("file.txt"): Read a plain

text file.

spark.read.format("jdbc").options(
...).load(): Load data from JDBC source.

df.write.csv("output.csv", header=True):
Write to CSV.

df.write.mode("overwrite").parquet
("out.parquet"): Overwrite and write to
Parquet.
coding_knowladge
Harry

Data Exploration
df.show(): Display first 20 rows.

df.show(10, truncate=False): Show 10 rows

with full column values.

df.printSchema(): Print schema of

DataFrame.

df.describe().show(): Summary stats for

numeric columns.

df.summary().show(): Count, mean,

stddev, min, max.

df.columns: List of column names.

df.dtypes: Get column names and their

data types.

df.count(): Total number of rows.

coding_knowladge
Harry

Data Cleaning
df.dropna(): Drop rows with any nulls.

df.dropna(how="all"): Drop rows where all

values are null.

df.dropna(subset=["col1", "col2"]): Drop

rows with nulls in specific columns.

df.fillna(0): Replace all nulls with 0.

df.fillna({"col": "missing"}): Replace nulls

in a column with specific value.

df.dropDuplicates(): Remove duplicate

rows.

df.dropDuplicates(["col1", "col2"]):
Remove duplicates based on columns.

df=df.withColumn("col",df["col"].cast("int
eger")): Convert data type.

df = df.filter(df["col"] <= 1000): Remove

outliers conditionally.
coding_knowladge
Harry

Data Manipulation
df.withColumn("d_col", df["col"] * 2):
Create new column with transformation.

df.withColumn("log_col",
F.log(df["col"])): Log transformation.

df.withColumn("flag", F.when(df["col"] >

100, 1).otherwise(0)): Conditional flag
column.

df.withColumnRenamed("old", "new"):
Rename a column.

df.selectExpr("col1 + col2 as total"): Use

SQL expression to manipulate columns.

df = df.drop("col1", "col2"): Drop multiple

columns.

df = df.select(F.col("col1"), F.col("col2")):
Select multiple columns using ‘col’. df =

df.withColumn("day",F.dayofmonth("da
te_col")): Extract day from date column.
coding_knowladge
Harry

Filtering & Conditions

df.filter(df["col"] > 50): Filter rows where
column > 50.

df.where(df["status"] == "active"): Filter

rows using `where`.

df.filter((df["col1"] > 10) & (df["col2"] <

100)): Filter with multiple conditions.

df.filter(df["col"].isin("A", "B")): Filter with

multiple matching values.

df.filter(df["col"].isNotNull()): Keep rows

with non-null values.
df.withColumn("category",

F.when(df["score"] > 80,

"High").otherwise("Low")): Categorize
values. df.where(~df["col"].isin("A", "B")):
Filter rows not in list.

df = df.limit(100): Limit rows for preview or

sampling.
coding_knowladge
Harry

Aggregation & Grouping

df.groupBy("col").count(): Count rows per
group.

df.groupBy("col").sum("sales"): Sum of a
column per group.

df.groupBy("col").avg("score"): Average
per group.

df.groupBy("region").agg(F.max(" sales"),
F.min("sales")): Multiple aggregations.

df.agg(F.mean("amount")).show():
Aggregate without grouping.

df.groupBy("col1",
"col2").agg(F.sum("val")): Group by
multiple columns.

df.rollup("col").sum("val").show (): Rollup

total + subtotals.

df.cube("col").sum("val").show() : Cube
(all combinations).
coding_knowladge
Harry

Sorting & Duplicates

df.orderBy("col"): Sort ascending.

df.orderBy(df["col"].desc()): Sort
descending.

df.sort("col1", "col2"): Sort by multiple

columns.

df.sortWithinPartitions("col") : Sort data

within partition.

df = df.dropDuplicates(): Remove all

duplicate rows.

df.dropDuplicates(["col"]): Remove
duplicates on column.

df = df.distinct(): Return unique rows.

df = df.limit(10): Return top 10 rows.

coding_knowladge
Harry

Joins & merge

df1.join(df2, "key"): Inner join on column.

df1.join(df2, "key", "left"): Left join.

df1.join(df2, "key", "right"): Right join.

df1.join(df2, "key", "outer"): Full outer join.

df1.join(df2, df1["id"] == df2["id"], "inner"):

Join using condition.

df1.crossJoin(df2): Cartesian join.

df1.union(df2): Append rows (same schema).

df1.unionByName(df2): Union using column

names.

DP 700
100% (6)
DP 700
141 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
LLM Based Text To SQL
No ratings yet
LLM Based Text To SQL
9 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
DataBricks Overview
No ratings yet
DataBricks Overview
13 pages
Comprehensive Pandas Guide
No ratings yet
Comprehensive Pandas Guide
171 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pandas Cheat Sheet for Data Science
No ratings yet
Pandas Cheat Sheet for Data Science
5 pages
Oracle 1z0-1072 v2020-05-22 q61
100% (1)
Oracle 1z0-1072 v2020-05-22 q61
23 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Pandas Operations Guide
No ratings yet
Pandas Operations Guide
6 pages
FI SL Extraction
No ratings yet
FI SL Extraction
6 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Ptu Dbms Question Papers
No ratings yet
Ptu Dbms Question Papers
2 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
File Upload Vulnerability in FCKEditor
No ratings yet
File Upload Vulnerability in FCKEditor
16 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Data Warehousing: Essential or Not?
No ratings yet
Data Warehousing: Essential or Not?
8 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Data Retrieval & Cleaning Guide
No ratings yet
Data Retrieval & Cleaning Guide
35 pages
Py Spark 1
No ratings yet
Py Spark 1
11 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Fsmo Roles
No ratings yet
Fsmo Roles
6 pages
Automated Flight Data System
No ratings yet
Automated Flight Data System
18 pages
Pandas
No ratings yet
Pandas
13 pages
SQL Script To Generate Script For Existing Database Permissions
No ratings yet
SQL Script To Generate Script For Existing Database Permissions
6 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Sample Data Dictionary
No ratings yet
Sample Data Dictionary
3 pages
BCORP Market Share Analysis Report
No ratings yet
BCORP Market Share Analysis Report
8 pages
How To Diagnose A Problem in The Item Catalog PDF
No ratings yet
How To Diagnose A Problem in The Item Catalog PDF
8 pages
SQL Cheatsheet: Icbc Road Test
No ratings yet
SQL Cheatsheet: Icbc Road Test
3 pages
Database Development Exam 2010
No ratings yet
Database Development Exam 2010
26 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
DB01 - Introduction To Database Systems
No ratings yet
DB01 - Introduction To Database Systems
62 pages
Management Equipment
No ratings yet
Management Equipment
1 page
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Programming Assign. Unit 6
No ratings yet
Programming Assign. Unit 6
3 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas Cheat Sheet Serves
No ratings yet
Pandas Cheat Sheet Serves
20 pages
Pharmacy Lecture (Data Processing)
No ratings yet
Pharmacy Lecture (Data Processing)
11 pages
Practical 4
No ratings yet
Practical 4
6 pages
NW Funda
No ratings yet
NW Funda
72 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Altamash
No ratings yet
Altamash
1 page
Linked Lists
No ratings yet
Linked Lists
1 page
Pandas Cheatsheet
No ratings yet
Pandas Cheatsheet
10 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Databse Chapter 5 - SQL-1
No ratings yet
Databse Chapter 5 - SQL-1
72 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
11 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
NEP BCA III Sem Database Management Systems
No ratings yet
NEP BCA III Sem Database Management Systems
2 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Azure Data
No ratings yet
Azure Data
6 pages
Python Practical File Guide
No ratings yet
Python Practical File Guide
37 pages
Mayank Mathur Pratical Finally Final File DBMS
No ratings yet
Mayank Mathur Pratical Finally Final File DBMS
38 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark Module 1
No ratings yet
Pyspark Module 1
63 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Imp Pandas Cheatsheet
No ratings yet
Imp Pandas Cheatsheet
11 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pandas Dataframe Cheat Sheet
No ratings yet
Pandas Dataframe Cheat Sheet
3 pages

Pyspark Cheatsheet

Uploaded by

Pyspark Cheatsheet

Uploaded by

coding_knowladge

Clean and analyze large datasets

from pyspark.sql import SparkSession

Reading & Writing Data

spark.read.json("file.json"): Read JSON file

spark.read.text("file.txt"): Read a plain

df.show(10, truncate=False): Show 10 rows

df.printSchema(): Print schema of

df.describe().show(): Summary stats for

df.summary().show(): Count, mean,

df.columns: List of column names.

df.dtypes: Get column names and their

df.count(): Total number of rows.

df.dropna(how="all"): Drop rows where all

df.dropna(subset=["col1", "col2"]): Drop

df.fillna(0): Replace all nulls with 0.

df.fillna({"col": "missing"}): Replace nulls

df.dropDuplicates(): Remove duplicate

df = df.filter(df["col"] <= 1000): Remove

df.withColumn("flag", F.when(df["col"] >

df.selectExpr("col1 + col2 as total"): Use

df = df.drop("col1", "col2"): Drop multiple

Filtering & Conditions

df.where(df["status"] == "active"): Filter

df.filter((df["col1"] > 10) & (df["col2"] <

df.filter(df["col"].isin("A", "B")): Filter with

df.filter(df["col"].isNotNull()): Keep rows

F.when(df["score"] > 80,

df = df.limit(100): Limit rows for preview or

Aggregation & Grouping

df.rollup("col").sum("val").show (): Rollup

Sorting & Duplicates

df.orderBy("col"): Sort ascending.

df.sort("col1", "col2"): Sort by multiple

df.sortWithinPartitions("col") : Sort data

df = df.dropDuplicates(): Remove all

df = df.distinct(): Return unique rows.

df = df.limit(10): Return top 10 rows.

Joins & merge

df1.join(df2, "key", "left"): Left join.

df1.join(df2, "key", "right"): Right join.

df1.join(df2, "key", "outer"): Full outer join.

df1.join(df2, df1["id"] == df2["id"], "inner"):

df1.crossJoin(df2): Cartesian join.

df1.union(df2): Append rows (same schema).

df1.unionByName(df2): Union using column

You might also like