0% found this document useful (0 votes)

54 views15 pages

Spark Essentials

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views15 pages

Spark Essentials

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Spark SQL using Python

By
Prof Shibdas Dutta
Associate Professor,

DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Table of Contents
Introduction
Basic Commands
• Introduction
• Many data scientists, analysts, and general business intelligence users
rely on interactive SQL queries for exploring data. And Spark SQL is a
tool that enables you to do so.
What Is Spark SQL?

Apache Spark is an open-source data processing framework for

processing large datasets in a distributed manner (in a cluster).

Spark SQL is a Spark module for structured data processing. One use
of Spark SQL is to execute SQL queries. In this post, let’s focus on
highlighting 9 basic commands of running SQL queries.

A dataset is a distributed collection of data. A DataFrame is a

Dataset organised into named columns.
Basic commands
Getting to Know Spark SQL

1 — Creating a SparkSession

A SparkSession can be used to create DataFrames, register DataFrames

as tables, execute SQL over tables, cache tables, and read parquet files.

# Import SparkSession from pyspark.sql

from pyspark.sql import SparkSession
# Create my_spark
my_spark = SparkSession.builder.getOrCreate()
2 — Creating DataFrames

There are many ways to create DataFrames in Spark. One of the ways
is from Spark Data Sources which is the example below.

# Creating DataFrames
df =
spark.read.format('bigquery').option('project','<your_project_ID>').o
ption('table',<your_table_name>).load()
3 — Inspecting Data

After you’ve created your DataFrame, I guess the next thing you may want to do is to do some quick inspect
your data. Here’s a few commands!

#print the schema of df

df.printSchema()
#display the content of df
df.show()
#display the first 5 rows of df
df.show(5)
# Print my_spark
print(my_spark)
# Print the tables in the catalog
print(spark.catalog.listTables())
Manipulating Data
4 — Creating columns

Let’s say you want to create a new column named newdf and display everything in the new
DataFrame.

# Creating or replacing a local temporary view with this DataFrame.

df.createOrReplaceTempView("people")
# Define my query
query = "SELECT *, (order_quantity*0.3) as bonus_quantity from people"
newdf = spark.sql(query)
#display the content of new dataframe
newdf.show()
5 — Selecting

.select

You can select a column by

newdf.select(“customer_id”).show()
6 — Filtering

.filter

Filter the column order_id where order_quantity > 10

# Filtering
df.filter(df["order_quantity"]>10).show()
7 — Aggregating

.min() .max() .count()

All of the common aggregation methods, like .min(), .max(), and .count() are
GroupedData methods. These are created by calling the .groupBy()
DataFrame method. To use these functions we call that method on the
DataFrame. For example, to find the minimum value of a column, col, in a
DataFrame, df, you could do:

df.groupBy().min("order_quantity").show()
This creates a GroupedData object (so you can use the .min() method), then finds the
minimum value in col, and returns it as a DataFrame.
8 — Grouping and Aggregating

.groupBy() | .avg()

For example you want to calculate average order_quantityand group by

order_id

#calculate average order_quantity, group by order_id

df.groupby("order_id").avg("order_quantity").show()
Now you’ll see that when I pass the name of one or more columns
in my DataFrame to the .groupBy() method, the aggregation
methods behave like when you use a GROUP BY statement in a SQL
query!
SELECT order_id, avg(order_quantity)
FROM df
GROUP BY order_id
9 — Running Queries Programmatically

The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.

You can save DataFrame as a Temporary Table and sql query on the saved table.

# Creating or replacing a local temporary view with this DataFrame.

df.createOrReplaceTempView("people")
# SQL statements can be run by using the sql method
query = "SELECT order_id, order_quantity from people where order_quantity < 10"
peopleCountDf = spark.sql(query)
# Display the content of df
peopleCountDf.show()
Output
Happy Learning

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
1725021614548
No ratings yet
1725021614548
293 pages
Format4 Tpacad 4 0 Code Definition Rev2 14 12 2016
No ratings yet
Format4 Tpacad 4 0 Code Definition Rev2 14 12 2016
49 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Cryptography & Network Security Course
No ratings yet
Cryptography & Network Security Course
84 pages
Chords Dark Side of Moon
No ratings yet
Chords Dark Side of Moon
14 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Assignment Problems: Paul Dawkins
No ratings yet
Assignment Problems: Paul Dawkins
176 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
HTML Code
No ratings yet
HTML Code
3 pages
Soft Computing: Concepts and Techniques: January 2014
No ratings yet
Soft Computing: Concepts and Techniques: January 2014
17 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
WWW Hackingarticles in Category Collection of Hacking Tools
No ratings yet
WWW Hackingarticles in Category Collection of Hacking Tools
28 pages
Pyspark
No ratings yet
Pyspark
10 pages
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
No ratings yet
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
59 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Py Spark
No ratings yet
Py Spark
7 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Cómo Escribir Un Ensayo Paso A Paso
100% (1)
Cómo Escribir Un Ensayo Paso A Paso
7 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
x300b User-Manual 20220527
No ratings yet
x300b User-Manual 20220527
42 pages
Unit-3 Hadoop Environment
No ratings yet
Unit-3 Hadoop Environment
31 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Optimal Design Algorithm Comparison
No ratings yet
Optimal Design Algorithm Comparison
33 pages
E-Wallet Adoption and Impact Study
No ratings yet
E-Wallet Adoption and Impact Study
30 pages
Sap Certification Orientation Sep9
No ratings yet
Sap Certification Orientation Sep9
23 pages
K-Means Clustering for Data Analysts
No ratings yet
K-Means Clustering for Data Analysts
25 pages
Pyspark
No ratings yet
Pyspark
4 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
3HE12133AAABTQZZA01 - V1 - 7705 SAR Card and Module Support Quick Reference Card Release 8.0
No ratings yet
3HE12133AAABTQZZA01 - V1 - 7705 SAR Card and Module Support Quick Reference Card Release 8.0
6 pages
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
No ratings yet
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
18 pages
PGP Machine Learning Brochure
No ratings yet
PGP Machine Learning Brochure
12 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Web Quiz App Design Overview
No ratings yet
Web Quiz App Design Overview
12 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
BMW ICOM Firmware Update Guide
No ratings yet
BMW ICOM Firmware Update Guide
4 pages
Chapter 4 Overview of Preventive Maintenance
No ratings yet
Chapter 4 Overview of Preventive Maintenance
14 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Improved BMS A Smart Electric Vehicle Design Based On An Intelligent Battery Management System
No ratings yet
Improved BMS A Smart Electric Vehicle Design Based On An Intelligent Battery Management System
8 pages
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
8 pages
Accelerated Verifiable Fair Digital Exchange: Ntroduction
No ratings yet
Accelerated Verifiable Fair Digital Exchange: Ntroduction
10 pages
6298 Schematics List
No ratings yet
6298 Schematics List
2 pages
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
No ratings yet
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
5 pages
LPC-Link2 Schematic Guide
No ratings yet
LPC-Link2 Schematic Guide
6 pages
DE 3000 Brochure
No ratings yet
DE 3000 Brochure
4 pages
Certified Scrum Master (CSM) : Description
No ratings yet
Certified Scrum Master (CSM) : Description
1 page
Final ETI Micro Project Report
0% (1)
Final ETI Micro Project Report
17 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
HP Software and Driver Downloads For HP Printers, Laptops, Desktops and More - HP® Customer Support
No ratings yet
HP Software and Driver Downloads For HP Printers, Laptops, Desktops and More - HP® Customer Support
1 page
(4th Year) Roadmap To Dream Placement
No ratings yet
(4th Year) Roadmap To Dream Placement
1 page
Aspire Archon User Manual
No ratings yet
Aspire Archon User Manual
1 page
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages

Spark Essentials

Uploaded by

Spark Essentials

Uploaded by

Spark SQL using Python

Apache Spark is an open-source data processing framework for

A dataset is a distributed collection of data. A DataFrame is a

A SparkSession can be used to create DataFrames, register DataFrames

# Import SparkSession from pyspark.sql

#print the schema of df

# Creating or replacing a local temporary view with this DataFrame.

You can select a column by

Filter the column order_id where order_quantity > 10

.min() .max() .count()

For example you want to calculate average order_quantityand group by

#calculate average order_quantity, group by order_id

# Creating or replacing a local temporary view with this DataFrame.

You might also like