ADE Training

Ade session

Uploaded by

reykaoruamane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views1 page

ADE Training

Ade session

Uploaded by

reykaoruamane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Python String Manipulation RDD Transformation PairRDD Transformation

Example
• Concatenate: “str1” + “str2” • RDD.map(lambda x: func(x)) • RDD.countByKey()  count each key
from textblob import TextBlob
• Change letter case: .lower(), .upper() • RDD.filter(lambda x: func(x)) • RDD.groupByKey()  group each key
from pyspark import SparkConf, SparkContext
• Get length: len(“str1”) • RDD.flatMap(lambda x: func(x)) • RDD.sortByKey()  sort asc/desc order
• Check substring in string: “str” in “new_str” • RDD.distinct() • RDD.join()  pairs with matching keys from 2 RDDs
def checkPolarity(line):
• Check string starts with: “ori_str”.startswith(“str”) • RDD.sortBy(lambda x:x[index], True) • RDD.keyBy()  set key by index
if line > 0: return “positive"
• Split string to list: “str”.split(“<delimiter>”) • RDD1.zip(RDD2)  merge column • RDD.reduceByKey()
elif line < 0: return “negative"
• RDD1.union(RDD2)  merge row • RDD.mean()
else: return "neutral"
• RDD1.subtract(RDD2)  intersect • RDD.keys()
Spark Shell • RDD.values()
def run_sc(sc,filename):
• Launch terminal • RDD.lookup(key)  values for a key
mydata = sc.textFile(filename)
• Type “pyspark” RDD Action • leftOuterJoin, rightOuterJoin
• Rename terminal • RDD.count()  return number of elements • RDD.mapValues()
# transform data
• Open NEW terminal • RDD.collect()  return array of all elements • RDD.flatMapValues()
mydata_clean = mydata \
• Type “sc” to test SparkContext • RDD.take(n)  return array of first n elements .map(lambda x: x.lower())\
• RDD.saveAsTextFile(dir)  save to text file(s) .map(lambda x: x.split(","))\
• RDD.toDebugString()  return lineage of RDD .filter(lambda x: len(x) == 8)\
Spark Application RDD to DataFrame
• Run: export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/python .filter(lambda x: len(x[1]) > 1)\
• df = sqlContext.createDataFrame(RDD)
• *Run: spark-submit <filename.py> .map(lambda x: (x[4],x[0],x[2],x[1],x[3],x[5],x[6],x[7]))
RDD Initialization • df = sqlContext.read.json(“filename.json”)
• Memory-based: mydata_clean.take(5)
* make sure in same directory as filename. If not, “cd” to directory myData = [“Today”, “is”, “Monday”] DataFrame to RDD
myRDD = sc.parallelize(myData) • df_RDD = df.rdd
# sentiment analysis
Spark Application - Local • File-based: mydata_SA = mydata \
• local[*]  run with max threads in cores *myData = sc.textFile(“filename.txt”) .map(lambda x: x[0])\
*myData = sc.textFile(“path/*.txt”) DataFrame Operations
• local[n]  run with n threads • printSchema() .map(lambda x: x.replace("'", "").replace('"', ""))\
• local  run with single thread *myData = sc.textFile(“filename1.txt, .map(lambda x: TextBlob(x).sentiment.polarity)\
filename2.txt”) • toDF()
• groupBy() .map(lambda x: checkPolarity(x))
**myData = sc.wholeTextFiles(“dir”)
IP Address: Port Numbers • show()
• filter() # combine data
• Hue: http://localhost:8889 * each line in each file is a separate record in RDD mydata_combined = mydata_SA.zip(mydata_clean)\
• Jupyter: http://localhost:8890 ** returns as (filename, content) pairs • join()
• columns .map(lambda x: str(x).replace("'", "").replace('"', "")
• Cloudera manager: http://localhost:7180 .replace('(', "").replace(')', "")
• Spark Application Web UI – http://localhost:18080 • describe()
• orderBy() .replace('[', "").replace(']', ""))
• PC backend IP: 10.0.2.15 (check via ifconfig)
• cache()
RDD Processing Example • distinct() # save text file
HDFS Command myData = sc.textFile(“filename.txt”, <num_of_partition>) • where() output_folder = “<project_name>"
• hdfs dfs –ls  list files in Hadoop myDataUpper = myData.map(lambda x: x.upper()) mydata_combined.saveAsTextFile(output_folder)
• hdfs dfs –get <filename>  transfer from Hadoop to local myDataFind = myDataUpper.filter(lambda x: “TODAY” in x)
• hdfs dfs –put <filename>  transfer from local to Hadoop myDataUpper.collect() if __name__ == "__main__":
Using SQL Queries on DataFrame
print() • df.registerTempTable(“df_sql”) # Configure your Spark environment
Troubleshoot Command • sqlContext.sql(“select name from df_sql”).show(5)) conf =
> Su hdfs SparkConf().setMaster("local[1]").setAppName("Project")
RDD Persistence
> hdfs dfsadmin –safemode leave sc = SparkContext(conf=conf)
myData1 = sc.textFile(“data.txt”).map(lambda
> su root
x:x.upper())
myData1.persist() filename = “filename.csv”
*myData2 = myData1.filter(lambda x: “TODAY” in x) run_sc(sc, filename)
myData2.collect() sc.stop()

* If re-run, shorter time

Emv Tutorial
0% (1)
Emv Tutorial
4 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
Coating and Services: Interplan Asset Integrity
No ratings yet
Coating and Services: Interplan Asset Integrity
22 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Low-Cost Real-Time Sar Simulation For Applications in Mission Planning, Education and Information Extraction
No ratings yet
Low-Cost Real-Time Sar Simulation For Applications in Mission Planning, Education and Information Extraction
5 pages
Ubuntu
100% (1)
Ubuntu
382 pages
Voila User Manual
No ratings yet
Voila User Manual
79 pages
OMF000001 Um Interface and Radio Channels ISSUE2.1
No ratings yet
OMF000001 Um Interface and Radio Channels ISSUE2.1
44 pages
Python Myssql Programs For Practical File Class 12 Ip
No ratings yet
Python Myssql Programs For Practical File Class 12 Ip
26 pages
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
100% (1)
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
24 pages
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
Chapter Two HTML: Internet Programming Compiled By:tadesse K
No ratings yet
Chapter Two HTML: Internet Programming Compiled By:tadesse K
162 pages
Excel Add-In User Guide
No ratings yet
Excel Add-In User Guide
7 pages
Lab Chapter 10 Use RDDs
0% (1)
Lab Chapter 10 Use RDDs
4 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Chapter 6 Lesson Plan (Introduction To Computer Graphics)
No ratings yet
Chapter 6 Lesson Plan (Introduction To Computer Graphics)
5 pages
12 SparkAggregatingData
No ratings yet
12 SparkAggregatingData
47 pages
11 SparkWorkingWithRDDs
No ratings yet
11 SparkWorkingWithRDDs
21 pages
Frequently Asked Questions
No ratings yet
Frequently Asked Questions
3 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
LOGO Access Tool Help
No ratings yet
LOGO Access Tool Help
22 pages
Description Features: PT6311 VFD Driver/Controller IC
No ratings yet
Description Features: PT6311 VFD Driver/Controller IC
22 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Gas Well Testing
No ratings yet
Gas Well Testing
29 pages
ATP Madrid Open Broadcast Guide
No ratings yet
ATP Madrid Open Broadcast Guide
4 pages
Shubham: Contact Objective
No ratings yet
Shubham: Contact Objective
2 pages
F 33813024
No ratings yet
F 33813024
3 pages
ERP User Guide - Basic M3 Functions
No ratings yet
ERP User Guide - Basic M3 Functions
30 pages
Ndace Project 123
100% (1)
Ndace Project 123
42 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
RS485, RS232, RS422, RS423, Quick Reference Guide
No ratings yet
RS485, RS232, RS422, RS423, Quick Reference Guide
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Bolean Implementation Using Mux
No ratings yet
Bolean Implementation Using Mux
29 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Chapter 3 Flow Through Tubing and Flow Lines Part 1 Revised
No ratings yet
Chapter 3 Flow Through Tubing and Flow Lines Part 1 Revised
68 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Rdd-Tranformations Continued
No ratings yet
Rdd-Tranformations Continued
8 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Software Architecture
No ratings yet
Software Architecture
37 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
Troubleshoot Section
100% (1)
Troubleshoot Section
487 pages
FCC Install Guide - With Dual Lane Support Updated
No ratings yet
FCC Install Guide - With Dual Lane Support Updated
35 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Spark Labs for Data Engineers
No ratings yet
Spark Labs for Data Engineers
133 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Sample Field Report Ict
No ratings yet
Sample Field Report Ict
2 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
PySpark RDD Guide for Data Scientists
No ratings yet
PySpark RDD Guide for Data Scientists
1 page
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Win7AIO x64 Aug2013
No ratings yet
Win7AIO x64 Aug2013
2 pages
Spark
No ratings yet
Spark
51 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Pyspark
No ratings yet
Pyspark
44 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Spark
No ratings yet
Spark
160 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Paper kcs403 Sess 21-22 Eve
No ratings yet
Paper kcs403 Sess 21-22 Eve
2 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Pyspark
No ratings yet
Pyspark
31 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Crash
No ratings yet
Crash
9 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Btm452 Hot Jan 2024 Set2
No ratings yet
Btm452 Hot Jan 2024 Set2
8 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Py Spark
No ratings yet
Py Spark
19 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
Module 3 - RM For Engine - 5 Nov 2022 Part 1 - MM
No ratings yet
Module 3 - RM For Engine - 5 Nov 2022 Part 1 - MM
21 pages
Numerical Simulation of Natural Convection
No ratings yet
Numerical Simulation of Natural Convection
30 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
S2 - Expected Value Concepts Sia
No ratings yet
S2 - Expected Value Concepts Sia
37 pages
Scientific Research Poster
No ratings yet
Scientific Research Poster
1 page

ADE Training

Uploaded by

ADE Training

Uploaded by

Python String Manipulation RDD Transformation PairRDD Transformation

* If re-run, shorter time

You might also like