Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views1 page

ADE Training

Ade session

Uploaded by

reykaoruamane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views1 page

ADE Training

Ade session

Uploaded by

reykaoruamane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Python String Manipulation RDD Transformation PairRDD Transformation

Example
• Concatenate: “str1” + “str2” • RDD.map(lambda x: func(x)) • RDD.countByKey()  count each key
from textblob import TextBlob
• Change letter case: .lower(), .upper() • RDD.filter(lambda x: func(x)) • RDD.groupByKey()  group each key
from pyspark import SparkConf, SparkContext
• Get length: len(“str1”) • RDD.flatMap(lambda x: func(x)) • RDD.sortByKey()  sort asc/desc order
• Check substring in string: “str” in “new_str” • RDD.distinct() • RDD.join()  pairs with matching keys from 2 RDDs
def checkPolarity(line):
• Check string starts with: “ori_str”.startswith(“str”) • RDD.sortBy(lambda x:x[index], True) • RDD.keyBy()  set key by index
if line > 0: return “positive"
• Split string to list: “str”.split(“<delimiter>”) • RDD1.zip(RDD2)  merge column • RDD.reduceByKey()
elif line < 0: return “negative"
• RDD1.union(RDD2)  merge row • RDD.mean()
else: return "neutral"
• RDD1.subtract(RDD2)  intersect • RDD.keys()
Spark Shell • RDD.values()
def run_sc(sc,filename):
• Launch terminal • RDD.lookup(key)  values for a key
mydata = sc.textFile(filename)
• Type “pyspark” RDD Action • leftOuterJoin, rightOuterJoin
• Rename terminal • RDD.count()  return number of elements • RDD.mapValues()
# transform data
• Open NEW terminal • RDD.collect()  return array of all elements • RDD.flatMapValues()
mydata_clean = mydata \
• Type “sc” to test SparkContext • RDD.take(n)  return array of first n elements .map(lambda x: x.lower())\
• RDD.saveAsTextFile(dir)  save to text file(s) .map(lambda x: x.split(","))\
• RDD.toDebugString()  return lineage of RDD .filter(lambda x: len(x) == 8)\
Spark Application RDD to DataFrame
• Run: export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/python .filter(lambda x: len(x[1]) > 1)\
• df = sqlContext.createDataFrame(RDD)
• *Run: spark-submit <filename.py> .map(lambda x: (x[4],x[0],x[2],x[1],x[3],x[5],x[6],x[7]))
RDD Initialization • df = sqlContext.read.json(“filename.json”)
• Memory-based: mydata_clean.take(5)
* make sure in same directory as filename. If not, “cd” to directory myData = [“Today”, “is”, “Monday”] DataFrame to RDD
myRDD = sc.parallelize(myData) • df_RDD = df.rdd
# sentiment analysis
Spark Application - Local • File-based: mydata_SA = mydata \
• local[*]  run with max threads in cores *myData = sc.textFile(“filename.txt”) .map(lambda x: x[0])\
*myData = sc.textFile(“path/*.txt”) DataFrame Operations
• local[n]  run with n threads • printSchema() .map(lambda x: x.replace("'", "").replace('"', ""))\
• local  run with single thread *myData = sc.textFile(“filename1.txt, .map(lambda x: TextBlob(x).sentiment.polarity)\
filename2.txt”) • toDF()
• groupBy() .map(lambda x: checkPolarity(x))
**myData = sc.wholeTextFiles(“dir”)
IP Address: Port Numbers • show()
• filter() # combine data
• Hue: http://localhost:8889 * each line in each file is a separate record in RDD mydata_combined = mydata_SA.zip(mydata_clean)\
• Jupyter: http://localhost:8890 ** returns as (filename, content) pairs • join()
• columns .map(lambda x: str(x).replace("'", "").replace('"', "")
• Cloudera manager: http://localhost:7180 .replace('(', "").replace(')', "")
• Spark Application Web UI – http://localhost:18080 • describe()
• orderBy() .replace('[', "").replace(']', ""))
• PC backend IP: 10.0.2.15 (check via ifconfig)
• cache()
RDD Processing Example • distinct() # save text file
HDFS Command myData = sc.textFile(“filename.txt”, <num_of_partition>) • where() output_folder = “<project_name>"
• hdfs dfs –ls  list files in Hadoop myDataUpper = myData.map(lambda x: x.upper()) mydata_combined.saveAsTextFile(output_folder)
• hdfs dfs –get <filename>  transfer from Hadoop to local myDataFind = myDataUpper.filter(lambda x: “TODAY” in x)
• hdfs dfs –put <filename>  transfer from local to Hadoop myDataUpper.collect() if __name__ == "__main__":
Using SQL Queries on DataFrame
print() • df.registerTempTable(“df_sql”) # Configure your Spark environment
Troubleshoot Command • sqlContext.sql(“select name from df_sql”).show(5)) conf =
> Su hdfs SparkConf().setMaster("local[1]").setAppName("Project")
RDD Persistence
> hdfs dfsadmin –safemode leave sc = SparkContext(conf=conf)
myData1 = sc.textFile(“data.txt”).map(lambda
> su root
x:x.upper())
myData1.persist() filename = “filename.csv”
*myData2 = myData1.filter(lambda x: “TODAY” in x) run_sc(sc, filename)
myData2.collect() sc.stop()

* If re-run, shorter time

You might also like