0% found this document useful (0 votes)

224 views1 page

PySpark RDD Cheat Sheet Guide

PySpark is the Python API for Spark that allows access to Spark's functionality from Python. It exposes the Spark programming model to Python. Some key actions on RDDs (Resilient Distributed Datasets) include counting elements, aggregating values, grouping by key, and performing reductions like summing or finding the minimum/maximum value.

Uploaded by

Angel Chirinos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

224 views1 page

PySpark RDD Cheat Sheet Guide

Uploaded by

Angel Chirinos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data

Basic Information Re ducing

Python For Data Science

>>> rdd.getNumPartitions() #List the number of partitions

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the

[('a',9),('b',2)]

>>> rdd.reduce(lambda a, b: a + b) #Merge the rdd values

rdd values for each key

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,'b',2)
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}
>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True
[('a',[7,2]),('b',[2])]

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> r dd3.max() #Maximum value of RDD elements

#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes

>>> r dd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Aggregate values of each RDD key

>>> rdd3.mean() #Mean value of RDD elements

>>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

>>> rdd3.stdev() #Standard deviation of RDD elements

#Aggregate the elements of each partition, and then the results

28.866070047722118
>>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, x &
ma min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

Inspect SparkContext > Mathematical Operations
A
# pply a function to each RDD element

>>> sc .version #Retrieve SparkContext version

>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd 2

>>> sc.pythonVer #Retrieve Python version

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
[('b',2),('a',7)]

>>> sc.master #Master URL to connect to

#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rd d

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name

['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd 2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of g g
rdd4 without chan in s

the key
>>> sc.defaultParallelism #Return default level of parallelism
>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf , SparkContext
> Selecting Data >>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]

>>> conf = (SparkConf()

>>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster("local")
Getting [('a',2),('b',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))
>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

>>> rdd.take(2) #Take first 2 RDD elements

[('a', 7), ('a', 2)]

Using The Shell >>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> r dd.repartition(4) #New RDD with 4 partitions

[('b', 2), ('a', 7)]

>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files d .

co e py
Samplin g
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
runtime path by passing a comma-separated list to --py-files.
Filtering > Saving
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
>>> r dd.saveAsTextFile("rdd.txt")

> Loading Data >>> rdd5.distinct().collect() #Return distinct RDD values

['a',2,'b',7]

>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', 'b']

>>> r dd = sc.parallelize([('a',7),('a',2),('b',2)])
> Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

> Iterating .
>>> sc stop()
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreach(g) #Apply a function to all RDD elements

External Data ('a', 7)

('b', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h / .
examples src main pyt on pi py

F .
>>> text ile = sc text ile( F "/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Focus 4 Test 1 GR A
80% (5)
Focus 4 Test 1 GR A
4 pages
Time Series
No ratings yet
Time Series
31 pages
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
Operating System Concepts Test
No ratings yet
Operating System Concepts Test
11 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Apache Spark 2.3: Key Updates
No ratings yet
Apache Spark 2.3: Key Updates
57 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
PySpark SparkSession Guide
No ratings yet
PySpark SparkSession Guide
63 pages
Python OOPS Exercises
No ratings yet
Python OOPS Exercises
7 pages
Batch Processing with Spark Guide
No ratings yet
Batch Processing with Spark Guide
41 pages
React Js
No ratings yet
React Js
21 pages
Python Oops
No ratings yet
Python Oops
10 pages
UI Design & Dev with Figma & React
No ratings yet
UI Design & Dev with Figma & React
72 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Oops
No ratings yet
Oops
71 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
React JS Course for Beginners
No ratings yet
React JS Course for Beginners
14 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Python Classes in Pune
No ratings yet
Python Classes in Pune
11 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
React JS: A Beginner's Guide
No ratings yet
React JS: A Beginner's Guide
13 pages
Databricks Exam
No ratings yet
Databricks Exam
14 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Lecture # 12 - Introduction To React JS
No ratings yet
Lecture # 12 - Introduction To React JS
76 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
React JS Developer
No ratings yet
React JS Developer
2 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
E-Learning Companies Overview
No ratings yet
E-Learning Companies Overview
10 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Azure Databricks Onboarding Guide
No ratings yet
Azure Databricks Onboarding Guide
298 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
ReactJS: Key Features & Benefits
No ratings yet
ReactJS: Key Features & Benefits
17 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Advanced Data Modeling Guide
No ratings yet
Advanced Data Modeling Guide
18 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark
No ratings yet
Spark
96 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
React JS Soc
No ratings yet
React JS Soc
9 pages
Pyspark
100% (1)
Pyspark
48 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Evolution of Web Frameworks & ReactJS
No ratings yet
Evolution of Web Frameworks & ReactJS
82 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Architecting Data Pipelines on GCP
No ratings yet
Architecting Data Pipelines on GCP
24 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
ANSAR HAYAT BigData Architect
No ratings yet
ANSAR HAYAT BigData Architect
3 pages
PySpark RDD Guide for Data Scientists
No ratings yet
PySpark RDD Guide for Data Scientists
1 page
Awrrpt 1 66643 66644
No ratings yet
Awrrpt 1 66643 66644
228 pages
COE301 Lab 11 Datapath Component Design
No ratings yet
COE301 Lab 11 Datapath Component Design
7 pages
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
No ratings yet
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
15 pages
Business 70 PDF
No ratings yet
Business 70 PDF
1 page
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
No ratings yet
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
9 pages
Illustration-W5 6
No ratings yet
Illustration-W5 6
16 pages
Maths
No ratings yet
Maths
114 pages
Geographical Data in The Computer-1
No ratings yet
Geographical Data in The Computer-1
36 pages
Partial Derivatives Quiz Analysis
No ratings yet
Partial Derivatives Quiz Analysis
8 pages
Traction Alternator Type Ta10106cy
No ratings yet
Traction Alternator Type Ta10106cy
64 pages
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
No ratings yet
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
26 pages
CCC Professional Cloud Security Manager
No ratings yet
CCC Professional Cloud Security Manager
32 pages
Evolution of Handwriting Systems
100% (2)
Evolution of Handwriting Systems
38 pages
Shivam
No ratings yet
Shivam
43 pages
Patrolling
No ratings yet
Patrolling
31 pages
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
100% (4)
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
8 pages
Hoc Sinh Gioi 8 - 2022
No ratings yet
Hoc Sinh Gioi 8 - 2022
10 pages
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
No ratings yet
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
2 pages
Listening Starter 1
No ratings yet
Listening Starter 1
9 pages
Electromagnetic Warp Drive Theory
No ratings yet
Electromagnetic Warp Drive Theory
16 pages
AI Lesson: Conditionals & Vocabulary
No ratings yet
AI Lesson: Conditionals & Vocabulary
6 pages
MTD3055VL 115349
No ratings yet
MTD3055VL 115349
5 pages
COC III Set Up Computer Server
No ratings yet
COC III Set Up Computer Server
77 pages
TECH-5 - Rahul Dhall CV
No ratings yet
TECH-5 - Rahul Dhall CV
3 pages
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
No ratings yet
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
59 pages
Cable Products Pricelist Cable Products Pricelist: Cable Products Price List Cable Products Price List
No ratings yet
Cable Products Pricelist Cable Products Pricelist: Cable Products Price List Cable Products Price List
24 pages

PySpark RDD Cheat Sheet Guide

Uploaded by

PySpark RDD Cheat Sheet Guide

Uploaded by

> Retrieving RDD Information

> Reshaping Data

Python For Data Science

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the

>>> rdd.reduce(lambda a, b: a + b) #Merge the rdd values

rdd values for each key

>>> rdd.count() #Count RDD instances 3

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

>>> rdd3.sum() #Sum of RDD elements 4950

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> r dd3.max() #Maximum value of RDD elements

PySpark is the Spark Python API that exposes

>>> rdd3.mean() #Mean value of RDD elements

>>> rdd3.stdev() #Standard deviation of RDD elements

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

>>> sc .version #Retrieve SparkContext version

>>> sc.pythonVer #Retrieve Python version

>>> sc.master #Master URL to connect to

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> sc.appName #Return application name

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

>>> conf = (SparkConf()

>>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

[('a', 7), ('a', 2)]

Using The Shell >>> rdd.first() #Take first RDD element

>>> rdd.top(2) #Take top 2 RDD elements

[('b', 2), ('a', 7)]

$ ./bin/pyspark --master local[4] --py-files d .

> Loading Data >>> rdd5.distinct().collect() #Return distinct RDD values

Para e ll lized Collections ['a', 'a', 'b']

>>> rdd3 = sc.parallelize(range(100))

>>> rdd.foreach(g) #Apply a function to all RDD elements

External Data ('a', 7)

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like