Step 4: Data Manipulation
Perform various DataFrame operations such as filtering, selecting columns, grouping, and
aggregating.
1. Filtering Data:
python
Copy code
# Filter rows where age > 21
df_filtered = df.filter(df.age > 21)
o filter(condition): Filters rows based on the given condition.
2. Selecting Specific Columns:
python
Copy code
# Select specific columns
df_selected = df_filtered.select("name", "age", "city")
o select(*columns): Selects specified columns from the DataFrame.
3. Grouping and Aggregating Data:
python
Copy code
# Group by city and count the number of occurrences
df_grouped = df_selected.groupBy("city").count()
o groupBy(*cols): Groups the DataFrame using the specified columns.
o count(): Counts the number of rows for each group.
4. Displaying Results:
python
Copy code
df_filtered.show()
df_selected.show()
df_grouped.show()
o show(): Displays the first 20 rows of the DataFrame by default.
Step 5: Run SQL Queries
Register the DataFrame as a temporary SQL view and execute SQL queries on it.
1. Registering the DataFrame as a SQL Temporary View:
python
Copy code
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
o createOrReplaceTempView(viewName): Registers the DataFrame as a temporary
view with the given name.
2. Executing SQL Queries:
python
Copy code
# Execute SQL query to count the number of people in each city where
age > 21
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people
WHERE age > 21 GROUP BY city")
o spark.sql(query): Executes the specified SQL query and returns the result as a
DataFrame.
3. Displaying SQL Query Results:
python
Copy code
sql_result.show()
o show(): Displays the first 20 rows of the DataFrame by default.
Step 6: Stop the SparkSession
After completing the operations, stop the SparkSession to free up resources.
python
Copy code
# Stop the Spark session
spark.stop()
stop(): Stops the SparkSession.
Complete Example Code
python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Step 1: Initialize SparkSession
spark = SparkSession.builder.appName("End-to-End DataFrame
Workflow").getOrCreate()
# Step 2: Create DataFrame from JSON file
df = spark.read.json("path/to/json/file.json")
# Step 3: Explore the DataFrame
df.printSchema()
df.show()
# Step 4: Data Manipulation
df_filtered = df.filter(col("age") > 21)
df_selected = df_filtered.select("name", "age", "city")
df_grouped = df_selected.groupBy("city").count()
df_filtered.show()
df_selected.show()
df_grouped.show()
# Step 5: Run SQL Queries
df.createOrReplaceTempView("people")
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people WHERE
age > 21 GROUP BY city")
sql_result.show()
# Step 6: Stop the SparkSession
spark.stop()