PySpark Functions Summary
PySpark Core (RDD Functions)
RDD Creation and Loading
sc.textFile() - Load text file into RDD
sc.parallelize() - Create RDD from collection with partitions
range() - Generate range of numbers
Transformations (Narrow)
map() - Apply function to each element
flatMap() - Apply function and flatten results
filter() - Filter elements based on condition
distinct() - Remove duplicates
union() - Combine RDDs
sortBy() - Sort by key with random key sort
mapPartitions() - Apply function to each partition
Transformations (Wide)
groupByKey() - Group by key (Note: aggregation by key, must use map)
reduceByKey() - Reduce by key (Note: perform local aggregation before final reduce, very less
shuffling, high performance)
sortByKey() - Sort by key (Note: sorting in each partition is done separately, then results are merged)
join() - Inner join
rightOuterJoin() - Right outer join
leftOuterJoin() - Left outer join
cogroup() - Group multiple RDDs
Transformations on Paired RDDs
mapValues() - Apply function to values only
keys() - Extract keys
values() - Extract values
Actions
collect() - Return RDD as list
take(n) - Take first n records
first() - Take only 1st element
count() - Count number of records in RDD
reduce() - Aggregate elements (only commutative and associative operations allowed)
saveAsTextFile() - Save RDD to text file
countByKey() - Count occurrences, return FlatMap
collectAsMap() - Convert to Python dictionary
Spark Joins
join() - Inner join
rightOuterJoin() - Key present in 1st RDD
leftOuterJoin() - Key present in 2nd RDD
PySpark DataFrame
DataFrame Creation
From collections: createDataFrame()
Import named tuple: NamedTuple
From Spark RDD: toDF()
DataFrame Methods
toDf() - Convert to DataFrame
createDataFrame() - Create DataFrame from RDD
show() - Display DataFrame
select() - Select specific columns
filter() - Filter rows based on condition
where() - Alternative to filter
groupBy() - Group by column, then apply aggregation (sum, max, count)
agg() - Aggregation function
limit(n) - First n records
distinct() - Unique values from column, return DataFrame
orderBy() - Sort by column (False = Descending)
printSchema() - Display DataFrame schema
DataFrame Metadata
df.columns - Column names
df.dtypes - Data types of each column
df.schema - Structure with column and type
PySpark SQL
SQL Context and Temporary Views
createOrReplaceTempView() - Create temporary view (table)
spark.sql() - Execute SQL queries
Date and Time Functions
current_date() - Get current date
current_date().show(n) - Limited rows
date_format() - Format date to other format
to_date() - Convert string to date
date_add() - Add specific days to date column
date_sub() - Subtract specific days from column
months_between() - Similar to date_diff() but months difference in float
year() , month() , next_day() , week_of_year() - Extract date parts from given date column
current_timestamp() - Current timestamp
hour() , minute() , second() - Same as year(), month()
to_timestamp() - Convert string to timestamp
Window Functions
Used with frame and partition concepts
Storage Levels and Persistence
Persistence Methods
cache() - Cache RDD (doesn't take any parameter)
persist() - Same as persist(pyspark, StorageLevel, MEMORY_ONLY)
Storage Levels
1. MEMORY_ONLY - Only in RAM
2. MEMORY_AND_DISK - If RAM full, use DISK
3. MEMORY_ONLY_SER - RDD as serialized Java object only on RAM
4. MEMORY_AND_DISK_SER - Serialized Java object on RAM, if RAM full then DISK
5. DISK_ONLY - Only DISK
6. MEMORY_ONLY_2 - Same as level above but replicate each partition on 2 cluster nodes
7. MEMORY_AND_DISK_2 - Similar concept
Key Concepts
Partitioning
Data distribution across partitions (P1, P2, P3)
Narrow vs Wide Transformations
Shuffling operations and performance impact
Optimization Techniques
RDD persistence and caching
Avoiding wide transformations when possible
Using appropriate storage levels
Network congestion and performance issues with shuffling