Spark SQL using Python
By
Prof Shibdas Dutta
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Table of Contents
Introduction
Basic Commands
• Introduction
• Many data scientists, analysts, and general business intelligence users
rely on interactive SQL queries for exploring data. And Spark SQL is a
tool that enables you to do so.
What Is Spark SQL?
Apache Spark is an open-source data processing framework for
processing large datasets in a distributed manner (in a cluster).
Spark SQL is a Spark module for structured data processing. One use
of Spark SQL is to execute SQL queries. In this post, let’s focus on
highlighting 9 basic commands of running SQL queries.
A dataset is a distributed collection of data. A DataFrame is a
Dataset organised into named columns.
Basic commands
Getting to Know Spark SQL
1 — Creating a SparkSession
A SparkSession can be used to create DataFrames, register DataFrames
as tables, execute SQL over tables, cache tables, and read parquet files.
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
# Create my_spark
my_spark = SparkSession.builder.getOrCreate()
2 — Creating DataFrames
There are many ways to create DataFrames in Spark. One of the ways
is from Spark Data Sources which is the example below.
# Creating DataFrames
df =
spark.read.format('bigquery').option('project','<your_project_ID>').o
ption('table',<your_table_name>).load()
3 — Inspecting Data
After you’ve created your DataFrame, I guess the next thing you may want to do is to do some quick inspect
your data. Here’s a few commands!
#print the schema of df
df.printSchema()
#display the content of df
df.show()
#display the first 5 rows of df
df.show(5)
# Print my_spark
print(my_spark)
# Print the tables in the catalog
print(spark.catalog.listTables())
Manipulating Data
4 — Creating columns
Let’s say you want to create a new column named newdf and display everything in the new
DataFrame.
# Creating or replacing a local temporary view with this DataFrame.
df.createOrReplaceTempView("people")
# Define my query
query = "SELECT *, (order_quantity*0.3) as bonus_quantity from people"
newdf = spark.sql(query)
#display the content of new dataframe
newdf.show()
5 — Selecting
.select
You can select a column by
newdf.select(“customer_id”).show()
6 — Filtering
.filter
Filter the column order_id where order_quantity > 10
# Filtering
df.filter(df["order_quantity"]>10).show()
7 — Aggregating
.min() .max() .count()
All of the common aggregation methods, like .min(), .max(), and .count() are
GroupedData methods. These are created by calling the .groupBy()
DataFrame method. To use these functions we call that method on the
DataFrame. For example, to find the minimum value of a column, col, in a
DataFrame, df, you could do:
df.groupBy().min("order_quantity").show()
This creates a GroupedData object (so you can use the .min() method), then finds the
minimum value in col, and returns it as a DataFrame.
8 — Grouping and Aggregating
.groupBy() | .avg()
For example you want to calculate average order_quantityand group by
order_id
#calculate average order_quantity, group by order_id
df.groupby("order_id").avg("order_quantity").show()
Now you’ll see that when I pass the name of one or more columns
in my DataFrame to the .groupBy() method, the aggregation
methods behave like when you use a GROUP BY statement in a SQL
query!
SELECT order_id, avg(order_quantity)
FROM df
GROUP BY order_id
9 — Running Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
You can save DataFrame as a Temporary Table and sql query on the saved table.
# Creating or replacing a local temporary view with this DataFrame.
df.createOrReplaceTempView("people")
# SQL statements can be run by using the sql method
query = "SELECT order_id, order_quantity from people where order_quantity < 10"
peopleCountDf = spark.sql(query)
# Display the content of df
peopleCountDf.show()
Output
Happy Learning