Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views2 pages

Indrani Cheat Sheet

The document provides an overview of Spark and its components, including RDDs and DataFrames, highlighting their functionalities, operations, and interactions with cluster frameworks. It explains concepts like transformations, actions, accumulators, and the differences between narrow and wide transformations. Additionally, it touches on machine learning principles, MapReduce processing, and HDFS architecture, emphasizing Spark's advantages over traditional MapReduce methods.

Uploaded by

Indu Borra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Indrani Cheat Sheet

The document provides an overview of Spark and its components, including RDDs and DataFrames, highlighting their functionalities, operations, and interactions with cluster frameworks. It explains concepts like transformations, actions, accumulators, and the differences between narrow and wide transformations. Additionally, it touches on machine learning principles, MapReduce processing, and HDFS architecture, emphasizing Spark's advantages over traditional MapReduce methods.

Uploaded by

Indu Borra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

SPARK and RDD

- Entry point for spark functionality is sc object. Only one per active job.
DATAFRAME
- It represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- DF are immutable, distributed and partitioned
- file types supported = wholeTextFiles, SequenceFiles, sc.hadoopRDD
- They have lazy evaluation, recovery through lineage graphs
- RDD – 2 Operations – Transformation: Create a new dataset from existing ones, Action – Returns a value to the driver program after running a computation on the dataset
- The have named columns and specialized API for working with tabular data
- When you create a RDD by reading a Hadoop file, by default, Spark creates one partition for each block of the file.
- df = spark.read.load("PATH", format="csv", sep=",", inferSchema="true",
- Transformations are lazy
header="true")
- key-value operations [reduceByKey(), groupByKey(), subtractBykey(), rdd1.join(rdd2), rdd.cogroup(rdd2) {groups data from both rdd sharing same key}
- df.select([“column1”,”column2”]) # Selecting column
- Broadcast variables let programmer keep a read-only variable cached on each machine.
- df.filter(df[‘column’]>21).show() # Filtering data
- Accumulators are variables that can only be added to through an associative operation. Used to implement couters, sums, efficiently in parallel. It is not read only
- ORDER BY COLUMN
- val accum = sc.accumulator(0) sc.parallelize(Array(1,2,3)).foreach(x => accum+=x) accum.value {ACCUMULATORS IN CODE}
from pyspark. sql . functions import desc, asc
- Transformations: [map, filter, sample, groupByKey, reduceByKey, sortByKey, flatMap, union, join, cogroup, cross, mapValues]
from pyspark. sql . functions import col , column
- Actions: [collect, reduce, count, saveAsTextFile(path), lookupKey, first]
df .orderBy(expr("count desc")) .show(2) # WAY 1
- Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute.
df .orderBy(col(" first ") .desc() ,col ("second").asc()) .show(2) # WAY 2
- Spark 3 options persist RDD [1. In-memory storage as deserialized Java objs {fastest}, 2. In-memory storage as serialized data {limited space}, 3. On-disk storage {costly computation}
df .orderBy("age", desc("name").show() # WAY 3
- When you persist (cache) an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset
- RDD maintains lineage for fault tolerance [ filtered_rdd.toDebugString() ] <- check linear of rdd [filtered_rdd.getNumPartitions()]
- GROUP BY
- Narrow Transformations: Transformations where each partition of the parent RDD contributes to at most one partition of the child
- df.groupBy(“age”).avg(“salary”)
RDD. No data shuffling – computation happens locally – faster and more efficient – map filter flatmap
- Wide Transformations: Transformations where each partition of the parent RDD can contribute to multiple partitions of the child
df .groupBy("department")
RDD. Data shuffling takes place – triggers a stage boundary – groupByKey reduceByKey join
.agg(sum("salary") . alias ("sum_salary"),
- Spark can interact with following cluster framework standalone scheduler, Apache Mesos, Hadoop Yarn, Kubernetes
avg(" salary ") . alias ("avg_salary") ,
- RDD API exists in Spark Core Engine
sum("bonus").alias ("sum_bonus"),
- Driver program access spark through sc object. Spark driver program works with the cluster manager to acquire executors on nodes in the cluster
max("bonus").alias ("max_bonus"))
- The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
- The driver program must listen for and accept incoming connections from its executors throughout its lifetime
JOIN
- Hadoop does fault tolerance using replication and RDD do it using lineage
df = left . join ( right , left .name == right.name, "inner") # in-place of inner anything
tracking for a group data object .count() is a transformation
can come

MOVIE – SPARK SQL RDD AVG AGE BY FIRST LETTER


from pyspark.sql.functions import count, desc data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
# 1. Movie with the highest count of ratings
rdd = sc.parallelize(data)
ratings.groupBy("movieId").count().orderBy(desc("count")).show(1)
# 2. Movie with the lowest count of ratings average_age_by_letter = (
ratings.groupBy("movieId").count().orderBy("count").show(1) rdd.map(lambda x: (x[0][0], (x[1], 1))) # Map to (first_letter, (age, 1))
# 3. Average ratings for each movie .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) # Sum ages and counts by letter
avgRatings = ratings.groupBy("movieId").avg("rating") .mapValues(lambda x: x[0] / x[1]) # Compute the average age
# 4. Movies with the highest average rating .collect()
ratings.groupBy("movieId").avg("rating").toDF("movieId", "avgRating").orderBy(desc("avgRating")).show(10) )
# 5. Movies with the lowest average ratings
ratings.groupBy("movieId").avg("rating").toDF("movieId", "avgRating").orderBy("avgRating").show(10)
SWAP KEY-VALUE FIRST THEN SORTS BASED ON NEW KEY THEN SWAPS BACK [swaps based on value in desc order]
# 6. Top 10 movies with the highest ratings after joining with movie names
avgRatings.join(movies, "movieId").orderBy(desc("avgRating")).select("title", "avgRating").show(10) data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
# 7. Movies with the tag 'mathematics' rdd = sc.parallelize(data)
tagMovies = tags.join(movies, "movieId") def sortByValue(rdd):
tagMovies.filter("tag like 'math%'").select("title").show() # Swap key-value, sort by value in descending order, then swap back
# 8. Average rating of movies tagged as 'artificial intelligence' return (rdd.map(lambda x: (x[1], x[0]))
tagMovies.filter("tag like 'artificial intell%'").join(avgRatings, "movieId").select("title", "avgRating").show() .sortByKey(ascending=False)
# 9. Average rating of movies in the 'Crime' genre .map(lambda x: (x[1], x[0])))
crimMovies = movies.filter("genres like '%Crim%'")
# Call the function and collect results
crimMovies.join(avgRatings, "movieId").select("title", "avgRating").show()
# 10. Most popular tag sorted_rdd = sortByValue(rdd)
tags.groupBy("tag").count().orderBy(desc("count")).show(1)
COUNTS HOW MANY STUDNETS ARE ENROLLED IN EACH MAJOR
majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
MACHINE LEARNING majors = sc.parallelize(majors_data)
- ML is Experience * Task = Performance
- Data mining: non-trivial extraction of implicit, previously unknown and potentially useful information from data
majors.values().countByValue() # WAY 1
- 3 types, supervised, unsupervised, reinforcement
- Supervised (if numerical target then regression if categorical target then classification) majors.values().map(lambda x: (x, 1)).countByKey() # WAY 2
- Unsupervised has clustering and associate rule mining (market-basket)
- Classification – Logistic regression data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
- Feature engineering is to transform columns rdd = sc.parallelize(data)
- Good features are meaningful, independent, correlated with the outcome, free of null values majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
- Scaling of data – 2 ways – Min-Max and Standard Scaling [ Converts data with mean = 0 and SD = 1 ] majors = sc.parallelize(majors_data)
- Regularization is the solution to overfitting in which you penalize large weights
2 2 2
- Regularization are of 2 types [ L1 = |w1| + |w2| + …. + |wn|; L2 = |W 1+ W 2 + …… + W n|^1/2]

- 𝑓(𝑤) = 𝜆 ∗ 𝑅(𝑤) + ⁄𝑛 ∑ 1
MAJOR AS THE KEY AND VALUE AS THE AGE OF THE OLDEST PERSON ENROLLED IN THAT MAJOR
𝐿(𝜔 ; 𝑥𝑖, 𝑦𝑖)
𝑛
- Loss function = Training Error + Regularization Penalty
majors.join(rdd).values().reduceByKey(max) # WAY 1

𝑖=1 𝑖
rdd.join(majors).values().map(lambda x: (x[1], x[0])).reduceByKey(lambda x, y: max(x, y)) # WAY 2

- Our Goal is to minimize f(w), L(w..) could be different types of loss function
- Regularization increases loss function, results in overfitting and vice versa. FIND THE MAJOR AND AGE OF 'JOHN' USING A SINGLE LINE QUERY
- Linear Regression tries to best fit the model to a set of data points. rdd.join(majors).lookup('john')
- Class imbalance is when one class is more likely than other rdd.join(majors).filter(lambda x: x[0]=='john').values()
- MLlib: IndexToString, Tokenizer, StopWordRemover majors.join(rdd).lookup('john')
- In logistic regression if regParam decreases than overfitting increase and if it increases then underfitting increase
-
names_data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
df = spark.read.option("header","true").option("inferSchema","true").csv("PATH")
names = sc.parallelize(data)
- from pyspark.ml.classification import LogisticRegression
majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, featuresCol = "features", labelCol =
majors = sc.parallelize(majors_data)
"automatic")
model = lr.fit(train)
result = lrModel.transform(test) FIND AVG AGE AND COUNT OF STUDENTS FOR EACH MAJOR
result.select("automatic","prediction").show() from pyspark.sql.functions import avg, desc, count
namesDF.join(majorsDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count"))

from pyspark.sql.functions import avg, desc, count


majorsDF.join(namesDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count"))

MapReduce
- MapReduce is a processing engine in HDFS. Map and Reduce task work in isolation
- Map = higher order function & part of functional programming ; has 2 args = list of data + lower order func
- Applies lower order func to each element in the list in parallel
- Reduce = higher order func, processes list of elements by applying a function pairwise and returns a scalar
- for loops cant be parallelized
- Mapper is present with data that contain multiple keys which it transforms into a 1-1 fashion
- Output of mapper is stored on local disk
- Reducer is present with data that only has one key
- MapReduce is done on per-key basis
1. Reading input data from HDFS.
2. Performing the Map operation to extract meaningful (key, value) pairs from the input data.
3. Storing intermediate outputs locally on disk.
4. Sorting and shuffling data to group values by their keys.
5. Executing the Reduce operation to aggregate results.
6. Writing the final output back to HDFS.
- Number of map task dependent of size of input data
HDFS - If node running map task fails, the application master will re-run the completed map tasks on
- Distributed, Partitioned, Fault-Tolerant through replication, Write-Once read-many, commodity hardware, file stored as blocks, high-latency and high throughput another node, since their results will be lost when node crashes.
- Follows Master-Slave Architecture ; Master = Name Nodes and Slave = Worker Nodes - Spark is better than traditional MapReduce coz of in-memory computation, DAG, API’s, Lazy Evaluation
- Has consistent namespace - Traditional MapReduce is inefficient for iterative and multi-pass algos
- Locality of computation: Computation is scheduled where data is located - It lacks primitives for data sharing
- The replication in Hadoop takes place in block level - It is inefficient to achieve fault tolerance as one block is replicated multiple times
- NN persists metadata in 2 ways [ 1. Fsimage, 2. Point-in-time snapshot ]
- Checkpointing: Taking fsimage and the edits log and compacting them into a new fsimage. Helps prevent edits log from becoming too large. YARN
- Secondary NN takes the responsibility of periodically merging the 2 log files as NN restart are in production - Splits resource management and job scheduling into separate daemons.
- HDFS designed for scale-out, adding more commodity hardware - Application is either single job or DAG of jobs
- HDFS provides streaming access to data - Resource Manager: Ultimate authority to arbitrate resources among all applications
- hdfs dfs -copyFromLocal (or hadoop fs -copyFromLocal) [Copies file from local to hdfs] - Node Manager: Per-machine framework agent responsible for containers, monitoring their resource usage,
- hdfs dfs -find (or hadoop fs -find) [ Used to search for files that match a specified pattern ] and reporting same to RM [runs on worker node]
- hdfs dfs -get (or hadoop fs -get) [Copies a file from HDFS to local directory] - Resource Manager has 2 components: Scheduler and Application Manager
- hdfs dfs -getfattr (or hadoop fs -getfattr) [ Gets extended attributes of the file ] - Scheduler: Pure scheduling and nothing else. Also has a pluggable policy. FIFO scheduler,
- The block size is large in HDFS to minimize disk seek time compared to data transfer time Capacity scheduler. It allocates resources based on the idea of containers.
- HDFS config is managed through .xml file type - Application manager: Responsible for accepting job-submissions and provides service for
- The fsImage log file is used to create point-in-time ss of the file system and incremental changes are written to the edits log file restarting application master container in case of failure
- Per-application AM has responsibility of negotiating appropriate resources from scheduler
- It also tracks status and reports failures
- One application master can manage multiple application containers - A container is abstraction representing a collection of physical resources, such as RAM, CPU on a single node.
- ApplicationMaster can request more containers by contacting the Scheduler in -By going to home directory of Hadoop and calling the bin/yarn command. [YARN command execution
the ResourceManager method]
- ApplicationMaster monitors and tracks progress of containers.
-
Feature Hasher:
work on categorical or numerical variables, Output lower dimensionality feature
vector TF-IDF Feature Extractor:
- For computing TF, either HashingTF or CountVectorizer can be used.
- The size of the output of the HashingTF transformer is determined by the numFeatures

Term Weighting
Term weights consist of 2 components
Local: how important is the term in this document?
Global: how important is the term in the
collection? Intuition:
Terms that appear often in a document should get high
weights and Terms that appear in many documents should
get low weights
To capture this mathematically we use
Term frequency (local)
Inverse document frequency (global)

AVERAGE USING
MapReduce data = [10, 20,
30, 40, 50]
rdd = sc.parallelize(data)
# Map each number to (value, 1) pairs
mapped_rdd = rdd.map(lambda x: (x, 1))
# Reduce by adding up values and counts
sum_count = mapped_rdd.reduce(lambda a, b: (a[0] + b[0], a[1] +
b[1])) # Calculate the average
average = sum_count[0] / sum_count[1

You might also like