Python String Manipulation RDD Transformation PairRDD Transformation
Example
• Concatenate: “str1” + “str2” • RDD.map(lambda x: func(x)) • RDD.countByKey() count each key
from textblob import TextBlob
• Change letter case: .lower(), .upper() • RDD.filter(lambda x: func(x)) • RDD.groupByKey() group each key
from pyspark import SparkConf, SparkContext
• Get length: len(“str1”) • RDD.flatMap(lambda x: func(x)) • RDD.sortByKey() sort asc/desc order
• Check substring in string: “str” in “new_str” • RDD.distinct() • RDD.join() pairs with matching keys from 2 RDDs
def checkPolarity(line):
• Check string starts with: “ori_str”.startswith(“str”) • RDD.sortBy(lambda x:x[index], True) • RDD.keyBy() set key by index
if line > 0: return “positive"
• Split string to list: “str”.split(“<delimiter>”) • RDD1.zip(RDD2) merge column • RDD.reduceByKey()
elif line < 0: return “negative"
• RDD1.union(RDD2) merge row • RDD.mean()
else: return "neutral"
• RDD1.subtract(RDD2) intersect • RDD.keys()
Spark Shell • RDD.values()
def run_sc(sc,filename):
• Launch terminal • RDD.lookup(key) values for a key
mydata = sc.textFile(filename)
• Type “pyspark” RDD Action • leftOuterJoin, rightOuterJoin
• Rename terminal • RDD.count() return number of elements • RDD.mapValues()
# transform data
• Open NEW terminal • RDD.collect() return array of all elements • RDD.flatMapValues()
mydata_clean = mydata \
• Type “sc” to test SparkContext • RDD.take(n) return array of first n elements .map(lambda x: x.lower())\
• RDD.saveAsTextFile(dir) save to text file(s) .map(lambda x: x.split(","))\
• RDD.toDebugString() return lineage of RDD .filter(lambda x: len(x) == 8)\
Spark Application RDD to DataFrame
• Run: export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/python .filter(lambda x: len(x[1]) > 1)\
• df = sqlContext.createDataFrame(RDD)
• *Run: spark-submit <filename.py> .map(lambda x: (x[4],x[0],x[2],x[1],x[3],x[5],x[6],x[7]))
RDD Initialization • df = sqlContext.read.json(“filename.json”)
• Memory-based: mydata_clean.take(5)
* make sure in same directory as filename. If not, “cd” to directory myData = [“Today”, “is”, “Monday”] DataFrame to RDD
myRDD = sc.parallelize(myData) • df_RDD = df.rdd
# sentiment analysis
Spark Application - Local • File-based: mydata_SA = mydata \
• local[*] run with max threads in cores *myData = sc.textFile(“filename.txt”) .map(lambda x: x[0])\
*myData = sc.textFile(“path/*.txt”) DataFrame Operations
• local[n] run with n threads • printSchema() .map(lambda x: x.replace("'", "").replace('"', ""))\
• local run with single thread *myData = sc.textFile(“filename1.txt, .map(lambda x: TextBlob(x).sentiment.polarity)\
filename2.txt”) • toDF()
• groupBy() .map(lambda x: checkPolarity(x))
**myData = sc.wholeTextFiles(“dir”)
IP Address: Port Numbers • show()
• filter() # combine data
• Hue: http://localhost:8889 * each line in each file is a separate record in RDD mydata_combined = mydata_SA.zip(mydata_clean)\
• Jupyter: http://localhost:8890 ** returns as (filename, content) pairs • join()
• columns .map(lambda x: str(x).replace("'", "").replace('"', "")
• Cloudera manager: http://localhost:7180 .replace('(', "").replace(')', "")
• Spark Application Web UI – http://localhost:18080 • describe()
• orderBy() .replace('[', "").replace(']', ""))
• PC backend IP: 10.0.2.15 (check via ifconfig)
• cache()
RDD Processing Example • distinct() # save text file
HDFS Command myData = sc.textFile(“filename.txt”, <num_of_partition>) • where() output_folder = “<project_name>"
• hdfs dfs –ls list files in Hadoop myDataUpper = myData.map(lambda x: x.upper()) mydata_combined.saveAsTextFile(output_folder)
• hdfs dfs –get <filename> transfer from Hadoop to local myDataFind = myDataUpper.filter(lambda x: “TODAY” in x)
• hdfs dfs –put <filename> transfer from local to Hadoop myDataUpper.collect() if __name__ == "__main__":
Using SQL Queries on DataFrame
print() • df.registerTempTable(“df_sql”) # Configure your Spark environment
Troubleshoot Command • sqlContext.sql(“select name from df_sql”).show(5)) conf =
> Su hdfs SparkConf().setMaster("local[1]").setAppName("Project")
RDD Persistence
> hdfs dfsadmin –safemode leave sc = SparkContext(conf=conf)
myData1 = sc.textFile(“data.txt”).map(lambda
> su root
x:x.upper())
myData1.persist() filename = “filename.csv”
*myData2 = myData1.filter(lambda x: “TODAY” in x) run_sc(sc, filename)
myData2.collect() sc.stop()
* If re-run, shorter time