0% found this document useful (0 votes)

55 views35 pages

2 - Intro To PySpark RDD

The document provides an introduction to PySpark's Resilient Distributed Datasets (RDDs), explaining their characteristics, creation methods, and operations. It covers RDD transformations and actions, including examples like map, filter, reduceByKey, and join, as well as how to work with pair RDDs. Additionally, it highlights the importance of partitioning and lazy evaluation in RDD operations.

Uploaded by

tilakapash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views35 pages

2 - Intro To PySpark RDD

Uploaded by

tilakapash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Introduction to

PySpark RDD
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devisetty
Science Analyst, CyVerse
What is RDD?
RDD = Resilient Distributed Datasets

BIG DATA FUNDAMENTALS WITH PYSPARK

Decomposing RDDs
Resilient Distributed Datasets
Resilient: Ability to withstand failures

Distributed: Spanning across multiple machines

Datasets: Collection of partitioned data e.g, Arrays, Tables, Tuples etc.,

BIG DATA FUNDAMENTALS WITH PYSPARK

Creating RDDs. How to do it?
Parallelizing an existing collection of objects
External datasets:
Files in HDFS

Objects in Amazon S3 bucket

lines in a text file

From existing RDDs

BIG DATA FUNDAMENTALS WITH PYSPARK

Parallelized collection (parallelizing)
parallelize() for creating RDDs from python lists

numRDD = sc.parallelize([1,2,3,4])

helloRDD = sc.parallelize("Hello world")

type(helloRDD)

BIG DATA FUNDAMENTALS WITH PYSPARK

From external datasets
textFile() for creating RDDs from external datasets

fileRDD = sc.textFile("README.md")

type(fileRDD)

BIG DATA FUNDAMENTALS WITH PYSPARK

Understanding Partitioning in PySpark
A partition is a logical division of a large distributed data set
parallelize() method

numRDD = sc.parallelize(range(10), minPartitions = 6)

textFile() method

fileRDD = sc.textFile("README.md", minPartitions = 6)

The number of partitions in an RDD can be found by using getNumPartitions() method

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
RDD operations in
PySpark
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devisetty
Science Analyst, CyVerse
Overview of PySpark operations

Transformations create new RDDs

Actions perform computation on the RDDs

BIG DATA FUNDAMENTALS WITH PYSPARK

RDD Transformations
Transformations follow Lazy evaluation

Basic RDD Transformations

map() , filter() , flatMap() , and union()

BIG DATA FUNDAMENTALS WITH PYSPARK

map() Transformation
map() transformation applies a function to all elements in the RDD

RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x)

BIG DATA FUNDAMENTALS WITH PYSPARK

filter() Transformation
Filter transformation returns a new RDD with only the elements that pass the condition

RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x: x > 2)

BIG DATA FUNDAMENTALS WITH PYSPARK

flatMap() Transformation
flatMap() transformation returns multiple values for each element in the original RDD

RDD = sc.parallelize(["hello world", "how are you"])

RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))

BIG DATA FUNDAMENTALS WITH PYSPARK

union() Transformation

inputRDD = sc.textFile("logs.txt")
errorRDD = inputRDD.filter(lambda x: "error" in x.split())
warningsRDD = inputRDD.filter(lambda x: "warnings" in x.split())
combinedRDD = errorRDD.union(warningsRDD)

BIG DATA FUNDAMENTALS WITH PYSPARK

RDD Actions
They are operations that return a value after running a computation on the RDD
Basic RDD Actions
collect()

take(N)

first()

count()

BIG DATA FUNDAMENTALS WITH PYSPARK

collect() and take() Actions
collect() return all the elements of the dataset as an array
take(N) returns an array with the first N elements of the dataset

RDD_map.collect()

[1, 4, 9, 16]

RDD_map.take(2)

[1, 4]

BIG DATA FUNDAMENTALS WITH PYSPARK

first() and count() Actions
first() prints the first element of the RDD

RDD_map.first()

[1]

count() return the number of elements in the RDD

RDD_flatmap.count()

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice RDD
operations
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Working with Pair
RDDs in PySpark
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devisetty
Science Analyst, CyVerse
Introduction to pair RDDs in PySpark
Real life datasets are usually key/value pairs
Each row is a key and maps to one or more values

Pair RDD is a special data structure to work with this kind of datasets

Pair RDD: Key is the identifier and value is the data

BIG DATA FUNDAMENTALS WITH PYSPARK

Creating pair RDDs
Two common ways to create pair RDDs
From a list of key-value tuple

From a regular RDD

Get the data into key/value form for paired RDD

my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]

pairRDD_tuple = sc.parallelize(my_tuple)

my_list = ['Sam 23', 'Mary 34', 'Peter 25']

regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))

BIG DATA FUNDAMENTALS WITH PYSPARK

Transformations on pair RDDs
All regular transformations work on pair RDD
Have to pass functions that operate on key value pairs rather than on individual elements

Examples of paired RDD Transformations

reduceByKey(func): Combine values with the same key

groupByKey(): Group values with the same key

sortByKey(): Return an RDD sorted by the key

join(): Join two pair RDDs based on their key

BIG DATA FUNDAMENTALS WITH PYSPARK

reduceByKey() transformation
reduceByKey() transformation combines values with the same key

It runs parallel operations for each key in the dataset

It is a transformation and not action

regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34),

("Neymar", 22), ("Messi", 24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)
pairRDD_reducebykey.collect()
[('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]

BIG DATA FUNDAMENTALS WITH PYSPARK

sortByKey() transformation
sortByKey() operation orders pair RDD by key

It returns an RDD sorted by key in ascending or descending order

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))

pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
[(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]

BIG DATA FUNDAMENTALS WITH PYSPARK

groupByKey() transformation
groupByKey() groups all the values with the same key in the pair RDD

airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]

regularRDD = sc.parallelize(airports)
pairRDD_group = regularRDD.groupByKey().collect()
for cont, air in pairRDD_group:
print(cont, list(air))
FR ['CDG']
US ['JFK', 'SFO']
UK ['LHR']

BIG DATA FUNDAMENTALS WITH PYSPARK

join() transformation
join() transformation joins the two pair RDDs based on their key

RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])

RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])

RDD1.join(RDD2).collect()
[('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
More actions
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devisetty
Science Analyst, CyVerse
reduce() action
reduce(func) action is used for aggregating the elements of a regular RDD
The function should be commutative (changing the order of the operands does not change
the result) and associative

An example of reduce() action in PySpark

x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)

BIG DATA FUNDAMENTALS WITH PYSPARK

saveAsTextFile() action
saveAsTextFile() action saves RDD into a text file inside a directory with each partition as
a separate file

RDD.saveAsTextFile("tempFile")

coalesce() method can be used to save RDD as a single text file

RDD.coalesce(1).saveAsTextFile("tempFile")

BIG DATA FUNDAMENTALS WITH PYSPARK

Action Operations on pair RDDs
RDD actions available for PySpark pair RDDs

Pair RDD actions leverage the key-value data

Few examples of pair RDD actions include

countByKey()

collectAsMap()

BIG DATA FUNDAMENTALS WITH PYSPARK

countByKey() action
countByKey() only available for type (K, V)

countByKey() action counts the number of elements for each key

Example of countByKey() on a simple list

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

for kee, val in rdd.countByKey().items():
print(kee, val)

('a', 2)
('b', 1)

BIG DATA FUNDAMENTALS WITH PYSPARK

collectAsMap() action
collectAsMap() return the key-value pairs in the RDD as a dictionary

Example of collectAsMap() on a simple tuple

sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

{1: 2, 3: 4}

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
78% (9)
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
2 pages
Pyspark
No ratings yet
Pyspark
31 pages
Car Mechanic Simulator 2021 Car Modding Guide
100% (3)
Car Mechanic Simulator 2021 Car Modding Guide
50 pages
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
100% (1)
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
95 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Spark
No ratings yet
Spark
96 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Presentation On Hacking and Cracking
No ratings yet
Presentation On Hacking and Cracking
15 pages
PlayStation Vita's First Year
50% (2)
PlayStation Vita's First Year
33 pages
Discover Haxeflixel Full
100% (3)
Discover Haxeflixel Full
182 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Abinitio Interview Questions
No ratings yet
Abinitio Interview Questions
10 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
SPL Transient Designer Manual
No ratings yet
SPL Transient Designer Manual
16 pages
Utr - PLN Suar PDF
100% (1)
Utr - PLN Suar PDF
86 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
No ratings yet
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
PySpark Essentials for Developers
100% (1)
PySpark Essentials for Developers
21 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Spark RDD Basics and Operations
No ratings yet
Spark RDD Basics and Operations
84 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Pyspark
No ratings yet
Pyspark
44 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Marketing Information Systems
No ratings yet
Marketing Information Systems
7 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Microsoft AZ-204 Exam Demo Guide
No ratings yet
Microsoft AZ-204 Exam Demo Guide
12 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Workflow Attributes - HTML Body
No ratings yet
Workflow Attributes - HTML Body
12 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Hailey College of Commerce Punjab University, Lahore: Assignment: A.I.S (Oracle) Submited To
No ratings yet
Hailey College of Commerce Punjab University, Lahore: Assignment: A.I.S (Oracle) Submited To
6 pages
SPARK
No ratings yet
SPARK
35 pages
Turunan Imidazoline Crodazoline o
No ratings yet
Turunan Imidazoline Crodazoline o
2 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
BDTT Lec 2023 24 Week3 Part2
No ratings yet
BDTT Lec 2023 24 Week3 Part2
69 pages
PySpark RDD: Transformations & Operations
No ratings yet
PySpark RDD: Transformations & Operations
40 pages
Design and Fabrication of Compact Bicycle Trolley
No ratings yet
Design and Fabrication of Compact Bicycle Trolley
7 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
HUT-A Hydraulic Universal Testing Machine 2018.6.26 PDF
No ratings yet
HUT-A Hydraulic Universal Testing Machine 2018.6.26 PDF
6 pages
CIT 207 MODULE v2
No ratings yet
CIT 207 MODULE v2
57 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Bolt - New Technical Implementation Explained
No ratings yet
Bolt - New Technical Implementation Explained
12 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
No ratings yet
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
44 pages
MSC Adams 2019.2 Software Overview
No ratings yet
MSC Adams 2019.2 Software Overview
2 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
IBM POST & BIOS Error Codes Guide
No ratings yet
IBM POST & BIOS Error Codes Guide
4 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
UTS - Lec 11 - Digital Self - Panganiban
No ratings yet
UTS - Lec 11 - Digital Self - Panganiban
13 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Chap 4
No ratings yet
Chap 4
36 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
HXGN Eam Saas Delivery Guide
No ratings yet
HXGN Eam Saas Delivery Guide
40 pages
CV Syllabus
No ratings yet
CV Syllabus
3 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
No ratings yet
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
20 pages
Stage 4 Business Analysis and System Recommendation
No ratings yet
Stage 4 Business Analysis and System Recommendation
8 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
FRST
No ratings yet
FRST
19 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Audison Thesis Car Audio
100% (3)
Audison Thesis Car Audio
5 pages
Research Paper
No ratings yet
Research Paper
5 pages
Py Spark
No ratings yet
Py Spark
9 pages
Py Spark
No ratings yet
Py Spark
19 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Pyspark
No ratings yet
Pyspark
4 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
External Video-En
No ratings yet
External Video-En
2 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page

2 - Intro To PySpark RDD

Uploaded by

2 - Intro To PySpark RDD

Uploaded by

Introduction to

BIG DATA FUNDAMENTALS WITH PYSPARK

Distributed: Spanning across multiple machines

BIG DATA FUNDAMENTALS WITH PYSPARK

Objects in Amazon S3 bucket

lines in a text file

From existing RDDs

BIG DATA FUNDAMENTALS WITH PYSPARK

helloRDD = sc.parallelize("Hello world")

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

numRDD = sc.parallelize(range(10), minPartitions = 6)

fileRDD = sc.textFile("README.md", minPartitions = 6)

The number of partitions in an RDD can be found by using getNumPartitions() method

BIG DATA FUNDAMENTALS WITH PYSPARK

Transformations create new RDDs

BIG DATA FUNDAMENTALS WITH PYSPARK

Basic RDD Transformations

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

RDD = sc.parallelize(["hello world", "how are you"])

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

count() return the number of elements in the RDD

BIG DATA FUNDAMENTALS WITH PYSPARK

Pair RDD: Key is the identifier and value is the data

BIG DATA FUNDAMENTALS WITH PYSPARK

From a regular RDD

Get the data into key/value form for paired RDD

my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]

my_list = ['Sam 23', 'Mary 34', 'Peter 25']

BIG DATA FUNDAMENTALS WITH PYSPARK

Examples of paired RDD Transformations

groupByKey(): Group values with the same key

sortByKey(): Return an RDD sorted by the key

join(): Join two pair RDDs based on their key

BIG DATA FUNDAMENTALS WITH PYSPARK

It runs parallel operations for each key in the dataset

It is a transformation and not action

regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34),

BIG DATA FUNDAMENTALS WITH PYSPARK

It returns an RDD sorted by key in ascending or descending order

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))

BIG DATA FUNDAMENTALS WITH PYSPARK

airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]

BIG DATA FUNDAMENTALS WITH PYSPARK

RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])

BIG DATA FUNDAMENTALS WITH PYSPARK

An example of reduce() action in PySpark

BIG DATA FUNDAMENTALS WITH PYSPARK

coalesce() method can be used to save RDD as a single text file

BIG DATA FUNDAMENTALS WITH PYSPARK

Pair RDD actions leverage the key-value data

Few examples of pair RDD actions include

BIG DATA FUNDAMENTALS WITH PYSPARK

countByKey() action counts the number of elements for each key

Example of countByKey() on a simple list

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

BIG DATA FUNDAMENTALS WITH PYSPARK

Example of collectAsMap() on a simple tuple

sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

BIG DATA FUNDAMENTALS WITH PYSPARK

You might also like