Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views4 pages

PySpark Notes

The document provides a comprehensive overview of PySpark functions, covering RDD creation, transformations, actions, DataFrame methods, SQL operations, date and time functions, window functions, and storage levels. It details various methods for manipulating and processing data using PySpark, including both narrow and wide transformations, as well as optimization techniques for performance. Additionally, it outlines persistence methods and the importance of partitioning in data processing.

Uploaded by

Govind Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

PySpark Notes

The document provides a comprehensive overview of PySpark functions, covering RDD creation, transformations, actions, DataFrame methods, SQL operations, date and time functions, window functions, and storage levels. It details various methods for manipulating and processing data using PySpark, including both narrow and wide transformations, as well as optimization techniques for performance. Additionally, it outlines persistence methods and the importance of partitioning in data processing.

Uploaded by

Govind Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

PySpark Functions Summary

PySpark Core (RDD Functions)

RDD Creation and Loading


sc.textFile() - Load text file into RDD

sc.parallelize() - Create RDD from collection with partitions

range() - Generate range of numbers

Transformations (Narrow)
map() - Apply function to each element

flatMap() - Apply function and flatten results

filter() - Filter elements based on condition

distinct() - Remove duplicates

union() - Combine RDDs

sortBy() - Sort by key with random key sort

mapPartitions() - Apply function to each partition

Transformations (Wide)
groupByKey() - Group by key (Note: aggregation by key, must use map)

reduceByKey() - Reduce by key (Note: perform local aggregation before final reduce, very less
shuffling, high performance)
sortByKey() - Sort by key (Note: sorting in each partition is done separately, then results are merged)

join() - Inner join

rightOuterJoin() - Right outer join

leftOuterJoin() - Left outer join

cogroup() - Group multiple RDDs

Transformations on Paired RDDs


mapValues() - Apply function to values only

keys() - Extract keys

values() - Extract values

Actions
collect() - Return RDD as list
take(n) - Take first n records

first() - Take only 1st element

count() - Count number of records in RDD

reduce() - Aggregate elements (only commutative and associative operations allowed)

saveAsTextFile() - Save RDD to text file

countByKey() - Count occurrences, return FlatMap

collectAsMap() - Convert to Python dictionary

Spark Joins
join() - Inner join

rightOuterJoin() - Key present in 1st RDD

leftOuterJoin() - Key present in 2nd RDD

PySpark DataFrame

DataFrame Creation
From collections: createDataFrame()
Import named tuple: NamedTuple

From Spark RDD: toDF()

DataFrame Methods
toDf() - Convert to DataFrame

createDataFrame() - Create DataFrame from RDD

show() - Display DataFrame

select() - Select specific columns

filter() - Filter rows based on condition

where() - Alternative to filter

groupBy() - Group by column, then apply aggregation (sum, max, count)

agg() - Aggregation function

limit(n) - First n records

distinct() - Unique values from column, return DataFrame

orderBy() - Sort by column (False = Descending)

printSchema() - Display DataFrame schema

DataFrame Metadata
df.columns - Column names
df.dtypes - Data types of each column

df.schema - Structure with column and type

PySpark SQL

SQL Context and Temporary Views


createOrReplaceTempView() - Create temporary view (table)

spark.sql() - Execute SQL queries

Date and Time Functions


current_date() - Get current date

current_date().show(n) - Limited rows

date_format() - Format date to other format

to_date() - Convert string to date

date_add() - Add specific days to date column

date_sub() - Subtract specific days from column

months_between() - Similar to date_diff() but months difference in float

year() , month() , next_day() , week_of_year() - Extract date parts from given date column

current_timestamp() - Current timestamp

hour() , minute() , second() - Same as year(), month()

to_timestamp() - Convert string to timestamp

Window Functions
Used with frame and partition concepts

Storage Levels and Persistence

Persistence Methods
cache() - Cache RDD (doesn't take any parameter)

persist() - Same as persist(pyspark, StorageLevel, MEMORY_ONLY)

Storage Levels
1. MEMORY_ONLY - Only in RAM

2. MEMORY_AND_DISK - If RAM full, use DISK

3. MEMORY_ONLY_SER - RDD as serialized Java object only on RAM

4. MEMORY_AND_DISK_SER - Serialized Java object on RAM, if RAM full then DISK

5. DISK_ONLY - Only DISK


6. MEMORY_ONLY_2 - Same as level above but replicate each partition on 2 cluster nodes
7. MEMORY_AND_DISK_2 - Similar concept

Key Concepts

Partitioning
Data distribution across partitions (P1, P2, P3)

Narrow vs Wide Transformations

Shuffling operations and performance impact

Optimization Techniques
RDD persistence and caching

Avoiding wide transformations when possible

Using appropriate storage levels

Network congestion and performance issues with shuffling

You might also like