0% found this document useful (0 votes)

16 views3 pages

Data Manipulation

This document discusses performing various operations on DataFrames in Spark SQL such as filtering rows, selecting columns, grouping data, running SQL queries, and stopping the Spark session. It provides code examples for each step.

Uploaded by

Shashini Karunarathna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views3 pages

Data Manipulation

Uploaded by

Shashini Karunarathna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Step 4: Data Manipulation

Perform various DataFrame operations such as filtering, selecting columns, grouping, and
aggregating.

1. Filtering Data:

python
Copy code
# Filter rows where age > 21
df_filtered = df.filter(df.age > 21)

o filter(condition): Filters rows based on the given condition.

2. Selecting Specific Columns:

python
Copy code
# Select specific columns
df_selected = df_filtered.select("name", "age", "city")

o select(*columns): Selects specified columns from the DataFrame.

3. Grouping and Aggregating Data:

python
Copy code
# Group by city and count the number of occurrences
df_grouped = df_selected.groupBy("city").count()

o groupBy(*cols): Groups the DataFrame using the specified columns.

o count(): Counts the number of rows for each group.

4. Displaying Results:

python
Copy code
df_filtered.show()
df_selected.show()
df_grouped.show()

o show(): Displays the first 20 rows of the DataFrame by default.

Step 5: Run SQL Queries

1. Registering the DataFrame as a SQL Temporary View:

python
Copy code
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
o createOrReplaceTempView(viewName): Registers the DataFrame as a temporary
view with the given name.

2. Executing SQL Queries:

python
Copy code
# Execute SQL query to count the number of people in each city where
age > 21
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people
WHERE age > 21 GROUP BY city")

o spark.sql(query): Executes the specified SQL query and returns the result as a
DataFrame.

3. Displaying SQL Query Results:

python
Copy code
sql_result.show()

o show(): Displays the first 20 rows of the DataFrame by default.

Step 6: Stop the SparkSession

After completing the operations, stop the SparkSession to free up resources.

python
Copy code
# Stop the Spark session
spark.stop()

 stop(): Stops the SparkSession.

Complete Example Code

python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize SparkSession

spark = SparkSession.builder.appName("End-to-End DataFrame
Workflow").getOrCreate()

# Step 2: Create DataFrame from JSON file

df = spark.read.json("path/to/json/file.json")

# Step 3: Explore the DataFrame

df.printSchema()
df.show()

# Step 4: Data Manipulation

df_filtered = df.filter(col("age") > 21)
df_selected = df_filtered.select("name", "age", "city")
df_grouped = df_selected.groupBy("city").count()
df_filtered.show()
df_selected.show()
df_grouped.show()

# Step 5: Run SQL Queries

df.createOrReplaceTempView("people")
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people WHERE
age > 21 GROUP BY city")
sql_result.show()

# Step 6: Stop the SparkSession

spark.stop()

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Tshell Mbist User
No ratings yet
Tshell Mbist User
864 pages
What Can vLinkGen in AUTOSAR Do - Breadboard Community PDF
No ratings yet
What Can vLinkGen in AUTOSAR Do - Breadboard Community PDF
7 pages
SQL Poster
No ratings yet
SQL Poster
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Python for Data Analysts
No ratings yet
Python for Data Analysts
2 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Battle of The Data Tools - Pandas Vs SQL
No ratings yet
Battle of The Data Tools - Pandas Vs SQL
12 pages
Answer Key For SET-1 TO 3
No ratings yet
Answer Key For SET-1 TO 3
7 pages
HHHH
No ratings yet
HHHH
22 pages
Pandas DataFrame Features Guide
No ratings yet
Pandas DataFrame Features Guide
13 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
I.P File
No ratings yet
I.P File
20 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
C Macro Program for Largest Number
No ratings yet
C Macro Program for Largest Number
12 pages
Python & MySQL For Data Analysis
No ratings yet
Python & MySQL For Data Analysis
45 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
Loop Programs
No ratings yet
Loop Programs
13 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Xojo Desktop App Tutorial
No ratings yet
Xojo Desktop App Tutorial
36 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
Learning Alfresco Web Scripts Sample Chapter
No ratings yet
Learning Alfresco Web Scripts Sample Chapter
16 pages
Pandas Library: Data Manipulation & Analysis Guide
No ratings yet
Pandas Library: Data Manipulation & Analysis Guide
9 pages
Practical
No ratings yet
Practical
12 pages
Fundamentals of Programming Lab Journal - Lab # 8: Objective
No ratings yet
Fundamentals of Programming Lab Journal - Lab # 8: Objective
3 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
05 Pandas Data Frames
No ratings yet
05 Pandas Data Frames
33 pages
UML Sequence Diagrams: CSE 403: Software Engineering, Spring 2015
No ratings yet
UML Sequence Diagrams: CSE 403: Software Engineering, Spring 2015
36 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
ROS Catkin Build System Guide
100% (1)
ROS Catkin Build System Guide
41 pages
HandsOn Questions
No ratings yet
HandsOn Questions
36 pages
Khalid Resume
No ratings yet
Khalid Resume
3 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Unit 2 Answer
No ratings yet
Unit 2 Answer
12 pages
Understanding OO vs. Procedural Programming
No ratings yet
Understanding OO vs. Procedural Programming
2 pages
Leica FlexLine GeoCOM Manual
No ratings yet
Leica FlexLine GeoCOM Manual
131 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Screenshot 2023-12-27 at 7.05.37 PM
No ratings yet
Screenshot 2023-12-27 at 7.05.37 PM
23 pages
Chapter 5
No ratings yet
Chapter 5
21 pages
Pandas - Programs
No ratings yet
Pandas - Programs
22 pages
Bca 2 Year Java Eng Unit 5
No ratings yet
Bca 2 Year Java Eng Unit 5
14 pages
Unit 3 FIOT
No ratings yet
Unit 3 FIOT
21 pages
BDA All 37 Practical Answers
No ratings yet
BDA All 37 Practical Answers
3 pages
Attract-Mode Translation Template
No ratings yet
Attract-Mode Translation Template
9 pages
OSAD
No ratings yet
OSAD
37 pages
Course Syllabus
No ratings yet
Course Syllabus
7 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Using Java For Scientific Programming and Electromagnetics
No ratings yet
Using Java For Scientific Programming and Electromagnetics
8 pages
HTML Code
No ratings yet
HTML Code
3 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Python Environment Setup and Essentials-1
No ratings yet
Python Environment Setup and Essentials-1
27 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
CV Leandro nodeJS English
No ratings yet
CV Leandro nodeJS English
5 pages
Data Analyst Cheat Sheet
No ratings yet
Data Analyst Cheat Sheet
28 pages
Insurance Management System
No ratings yet
Insurance Management System
38 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Luis Fernando Trueba (CV)
No ratings yet
Luis Fernando Trueba (CV)
1 page
Features of Java
No ratings yet
Features of Java
6 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
11 pages
Pizza Delivery Scheduling Algorithms
No ratings yet
Pizza Delivery Scheduling Algorithms
4 pages
Parallel and Distributed Computing Lec 1 & 2
No ratings yet
Parallel and Distributed Computing Lec 1 & 2
32 pages
Pandas
No ratings yet
Pandas
35 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Pandas
No ratings yet
Pandas
6 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
Pandas Trampas
No ratings yet
Pandas Trampas
9 pages
Distributed System Notes
0% (1)
Distributed System Notes
17 pages
.2 Dse
No ratings yet
.2 Dse
14 pages
Pandas Guide
No ratings yet
Pandas Guide
50 pages
Data Science
No ratings yet
Data Science
6 pages
Pandas Dataframe Cheat Sheet
No ratings yet
Pandas Dataframe Cheat Sheet
3 pages

Data Manipulation

Uploaded by

Data Manipulation

Uploaded by

Step 4: Data Manipulation

o filter(condition): Filters rows based on the given condition.

2. Selecting Specific Columns:

o select(*columns): Selects specified columns from the DataFrame.

3. Grouping and Aggregating Data:

o groupBy(*cols): Groups the DataFrame using the specified columns.

o show(): Displays the first 20 rows of the DataFrame by default.

Step 5: Run SQL Queries

1. Registering the DataFrame as a SQL Temporary View:

2. Executing SQL Queries:

3. Displaying SQL Query Results:

o show(): Displays the first 20 rows of the DataFrame by default.

Step 6: Stop the SparkSession

After completing the operations, stop the SparkSession to free up resources.

 stop(): Stops the SparkSession.

Complete Example Code

# Step 1: Initialize SparkSession

# Step 2: Create DataFrame from JSON file

# Step 3: Explore the DataFrame

# Step 4: Data Manipulation

# Step 5: Run SQL Queries

# Step 6: Stop the SparkSession

You might also like