Indrani Cheat Sheet

The document provides an overview of Spark and its components, including RDDs and DataFrames, highlighting their functionalities, operations, and interactions with cluster frameworks. It explains concepts like transformations, actions, accumulators, and the differences between narrow and wide transformations. Additionally, it touches on machine learning principles, MapReduce processing, and HDFS architecture, emphasizing Spark's advantages over traditional MapReduce methods.

Uploaded by

Indu Borra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views2 pages

Indrani Cheat Sheet

Uploaded by

Indu Borra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

SPARK and RDD

- Entry point for spark functionality is sc object. Only one per active job.
DATAFRAME
- It represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- DF are immutable, distributed and partitioned
- file types supported = wholeTextFiles, SequenceFiles, sc.hadoopRDD
- They have lazy evaluation, recovery through lineage graphs
- RDD – 2 Operations – Transformation: Create a new dataset from existing ones, Action – Returns a value to the driver program after running a computation on the dataset
- The have named columns and specialized API for working with tabular data
- When you create a RDD by reading a Hadoop file, by default, Spark creates one partition for each block of the file.
- df = spark.read.load("PATH", format="csv", sep=",", inferSchema="true",
- Transformations are lazy
header="true")
- key-value operations [reduceByKey(), groupByKey(), subtractBykey(), rdd1.join(rdd2), rdd.cogroup(rdd2) {groups data from both rdd sharing same key}
- df.select([“column1”,”column2”]) # Selecting column
- Broadcast variables let programmer keep a read-only variable cached on each machine.
- df.filter(df[‘column’]>21).show() # Filtering data
- Accumulators are variables that can only be added to through an associative operation. Used to implement couters, sums, efficiently in parallel. It is not read only
- ORDER BY COLUMN
- val accum = sc.accumulator(0) sc.parallelize(Array(1,2,3)).foreach(x => accum+=x) accum.value {ACCUMULATORS IN CODE}
from pyspark. sql . functions import desc, asc
- Transformations: [map, filter, sample, groupByKey, reduceByKey, sortByKey, flatMap, union, join, cogroup, cross, mapValues]
from pyspark. sql . functions import col , column
- Actions: [collect, reduce, count, saveAsTextFile(path), lookupKey, first]
df .orderBy(expr("count desc")) .show(2) # WAY 1
- Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute.
df .orderBy(col(" first ") .desc() ,col ("second").asc()) .show(2) # WAY 2
- Spark 3 options persist RDD [1. In-memory storage as deserialized Java objs {fastest}, 2. In-memory storage as serialized data {limited space}, 3. On-disk storage {costly computation}
df .orderBy("age", desc("name").show() # WAY 3
- When you persist (cache) an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset
- RDD maintains lineage for fault tolerance [ filtered_rdd.toDebugString() ] <- check linear of rdd [filtered_rdd.getNumPartitions()]
- GROUP BY
- Narrow Transformations: Transformations where each partition of the parent RDD contributes to at most one partition of the child
- df.groupBy(“age”).avg(“salary”)
RDD. No data shuffling – computation happens locally – faster and more efficient – map filter flatmap
- Wide Transformations: Transformations where each partition of the parent RDD can contribute to multiple partitions of the child
df .groupBy("department")
RDD. Data shuffling takes place – triggers a stage boundary – groupByKey reduceByKey join
.agg(sum("salary") . alias ("sum_salary"),
- Spark can interact with following cluster framework standalone scheduler, Apache Mesos, Hadoop Yarn, Kubernetes
avg(" salary ") . alias ("avg_salary") ,
- RDD API exists in Spark Core Engine
sum("bonus").alias ("sum_bonus"),
- Driver program access spark through sc object. Spark driver program works with the cluster manager to acquire executors on nodes in the cluster
max("bonus").alias ("max_bonus"))
- The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
- The driver program must listen for and accept incoming connections from its executors throughout its lifetime
JOIN
- Hadoop does fault tolerance using replication and RDD do it using lineage
df = left . join ( right , left .name == right.name, "inner") # in-place of inner anything
tracking for a group data object .count() is a transformation
can come

MOVIE – SPARK SQL RDD AVG AGE BY FIRST LETTER

from pyspark.sql.functions import count, desc data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
# 1. Movie with the highest count of ratings
rdd = sc.parallelize(data)
ratings.groupBy("movieId").count().orderBy(desc("count")).show(1)
# 2. Movie with the lowest count of ratings average_age_by_letter = (
ratings.groupBy("movieId").count().orderBy("count").show(1) rdd.map(lambda x: (x[0][0], (x[1], 1))) # Map to (first_letter, (age, 1))
# 3. Average ratings for each movie .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) # Sum ages and counts by letter
avgRatings = ratings.groupBy("movieId").avg("rating") .mapValues(lambda x: x[0] / x[1]) # Compute the average age
# 4. Movies with the highest average rating .collect()
ratings.groupBy("movieId").avg("rating").toDF("movieId", "avgRating").orderBy(desc("avgRating")).show(10) )
# 5. Movies with the lowest average ratings
ratings.groupBy("movieId").avg("rating").toDF("movieId", "avgRating").orderBy("avgRating").show(10)
SWAP KEY-VALUE FIRST THEN SORTS BASED ON NEW KEY THEN SWAPS BACK [swaps based on value in desc order]
# 6. Top 10 movies with the highest ratings after joining with movie names
avgRatings.join(movies, "movieId").orderBy(desc("avgRating")).select("title", "avgRating").show(10) data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
# 7. Movies with the tag 'mathematics' rdd = sc.parallelize(data)
tagMovies = tags.join(movies, "movieId") def sortByValue(rdd):
tagMovies.filter("tag like 'math%'").select("title").show() # Swap key-value, sort by value in descending order, then swap back
# 8. Average rating of movies tagged as 'artificial intelligence' return (rdd.map(lambda x: (x[1], x[0]))
tagMovies.filter("tag like 'artificial intell%'").join(avgRatings, "movieId").select("title", "avgRating").show() .sortByKey(ascending=False)
# 9. Average rating of movies in the 'Crime' genre .map(lambda x: (x[1], x[0])))
crimMovies = movies.filter("genres like '%Crim%'")
# Call the function and collect results
crimMovies.join(avgRatings, "movieId").select("title", "avgRating").show()
# 10. Most popular tag sorted_rdd = sortByValue(rdd)
tags.groupBy("tag").count().orderBy(desc("count")).show(1)
COUNTS HOW MANY STUDNETS ARE ENROLLED IN EACH MAJOR
majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
MACHINE LEARNING majors = sc.parallelize(majors_data)
- ML is Experience * Task = Performance
- Data mining: non-trivial extraction of implicit, previously unknown and potentially useful information from data
majors.values().countByValue() # WAY 1
- 3 types, supervised, unsupervised, reinforcement
- Supervised (if numerical target then regression if categorical target then classification) majors.values().map(lambda x: (x, 1)).countByKey() # WAY 2
- Unsupervised has clustering and associate rule mining (market-basket)
- Classification – Logistic regression data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
- Feature engineering is to transform columns rdd = sc.parallelize(data)
- Good features are meaningful, independent, correlated with the outcome, free of null values majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
- Scaling of data – 2 ways – Min-Max and Standard Scaling [ Converts data with mean = 0 and SD = 1 ] majors = sc.parallelize(majors_data)
- Regularization is the solution to overfitting in which you penalize large weights
2 2 2
- Regularization are of 2 types [ L1 = |w1| + |w2| + …. + |wn|; L2 = |W 1+ W 2 + …… + W n|^1/2]

- 𝑓(𝑤) = 𝜆 ∗ 𝑅(𝑤) + ⁄𝑛 ∑ 1
MAJOR AS THE KEY AND VALUE AS THE AGE OF THE OLDEST PERSON ENROLLED IN THAT MAJOR
𝐿(𝜔 ; 𝑥𝑖, 𝑦𝑖)
𝑛
- Loss function = Training Error + Regularization Penalty
majors.join(rdd).values().reduceByKey(max) # WAY 1

𝑖=1 𝑖
rdd.join(majors).values().map(lambda x: (x[1], x[0])).reduceByKey(lambda x, y: max(x, y)) # WAY 2

- Our Goal is to minimize f(w), L(w..) could be different types of loss function
- Regularization increases loss function, results in overfitting and vice versa. FIND THE MAJOR AND AGE OF 'JOHN' USING A SINGLE LINE QUERY
- Linear Regression tries to best fit the model to a set of data points. rdd.join(majors).lookup('john')
- Class imbalance is when one class is more likely than other rdd.join(majors).filter(lambda x: x[0]=='john').values()
- MLlib: IndexToString, Tokenizer, StopWordRemover majors.join(rdd).lookup('john')
- In logistic regression if regParam decreases than overfitting increase and if it increases then underfitting increase
-
names_data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]]
df = spark.read.option("header","true").option("inferSchema","true").csv("PATH")
names = sc.parallelize(data)
- from pyspark.ml.classification import LogisticRegression
majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]]
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, featuresCol = "features", labelCol =
majors = sc.parallelize(majors_data)
"automatic")
model = lr.fit(train)
result = lrModel.transform(test) FIND AVG AGE AND COUNT OF STUDENTS FOR EACH MAJOR
result.select("automatic","prediction").show() from pyspark.sql.functions import avg, desc, count
namesDF.join(majorsDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count"))

from pyspark.sql.functions import avg, desc, count

majorsDF.join(namesDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count"))

MapReduce
- MapReduce is a processing engine in HDFS. Map and Reduce task work in isolation
- Map = higher order function & part of functional programming ; has 2 args = list of data + lower order func
- Applies lower order func to each element in the list in parallel
- Reduce = higher order func, processes list of elements by applying a function pairwise and returns a scalar
- for loops cant be parallelized
- Mapper is present with data that contain multiple keys which it transforms into a 1-1 fashion
- Output of mapper is stored on local disk
- Reducer is present with data that only has one key
- MapReduce is done on per-key basis
1. Reading input data from HDFS.
2. Performing the Map operation to extract meaningful (key, value) pairs from the input data.
3. Storing intermediate outputs locally on disk.
4. Sorting and shuffling data to group values by their keys.
5. Executing the Reduce operation to aggregate results.
6. Writing the final output back to HDFS.
- Number of map task dependent of size of input data
HDFS - If node running map task fails, the application master will re-run the completed map tasks on
- Distributed, Partitioned, Fault-Tolerant through replication, Write-Once read-many, commodity hardware, file stored as blocks, high-latency and high throughput another node, since their results will be lost when node crashes.
- Follows Master-Slave Architecture ; Master = Name Nodes and Slave = Worker Nodes - Spark is better than traditional MapReduce coz of in-memory computation, DAG, API’s, Lazy Evaluation
- Has consistent namespace - Traditional MapReduce is inefficient for iterative and multi-pass algos
- Locality of computation: Computation is scheduled where data is located - It lacks primitives for data sharing
- The replication in Hadoop takes place in block level - It is inefficient to achieve fault tolerance as one block is replicated multiple times
- NN persists metadata in 2 ways [ 1. Fsimage, 2. Point-in-time snapshot ]
- Checkpointing: Taking fsimage and the edits log and compacting them into a new fsimage. Helps prevent edits log from becoming too large. YARN
- Secondary NN takes the responsibility of periodically merging the 2 log files as NN restart are in production - Splits resource management and job scheduling into separate daemons.
- HDFS designed for scale-out, adding more commodity hardware - Application is either single job or DAG of jobs
- HDFS provides streaming access to data - Resource Manager: Ultimate authority to arbitrate resources among all applications
- hdfs dfs -copyFromLocal (or hadoop fs -copyFromLocal) [Copies file from local to hdfs] - Node Manager: Per-machine framework agent responsible for containers, monitoring their resource usage,
- hdfs dfs -find (or hadoop fs -find) [ Used to search for files that match a specified pattern ] and reporting same to RM [runs on worker node]
- hdfs dfs -get (or hadoop fs -get) [Copies a file from HDFS to local directory] - Resource Manager has 2 components: Scheduler and Application Manager
- hdfs dfs -getfattr (or hadoop fs -getfattr) [ Gets extended attributes of the file ] - Scheduler: Pure scheduling and nothing else. Also has a pluggable policy. FIFO scheduler,
- The block size is large in HDFS to minimize disk seek time compared to data transfer time Capacity scheduler. It allocates resources based on the idea of containers.
- HDFS config is managed through .xml file type - Application manager: Responsible for accepting job-submissions and provides service for
- The fsImage log file is used to create point-in-time ss of the file system and incremental changes are written to the edits log file restarting application master container in case of failure
- Per-application AM has responsibility of negotiating appropriate resources from scheduler
- It also tracks status and reports failures
- One application master can manage multiple application containers - A container is abstraction representing a collection of physical resources, such as RAM, CPU on a single node.
- ApplicationMaster can request more containers by contacting the Scheduler in -By going to home directory of Hadoop and calling the bin/yarn command. [YARN command execution
the ResourceManager method]
- ApplicationMaster monitors and tracks progress of containers.
-
Feature Hasher:
work on categorical or numerical variables, Output lower dimensionality feature
vector TF-IDF Feature Extractor:
- For computing TF, either HashingTF or CountVectorizer can be used.
- The size of the output of the HashingTF transformer is determined by the numFeatures

Term Weighting
Term weights consist of 2 components
Local: how important is the term in this document?
Global: how important is the term in the
collection? Intuition:
Terms that appear often in a document should get high
weights and Terms that appear in many documents should
get low weights
To capture this mathematically we use
Term frequency (local)
Inverse document frequency (global)

AVERAGE USING
MapReduce data = [10, 20,
30, 40, 50]
rdd = sc.parallelize(data)
# Map each number to (value, 1) pairs
mapped_rdd = rdd.map(lambda x: (x, 1))
# Reduce by adding up values and counts
sum_count = mapped_rdd.reduce(lambda a, b: (a[0] + b[0], a[1] +
b[1])) # Calculate the average
average = sum_count[0] / sum_count[1

Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
No ratings yet
Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
202 pages
Zambian Grid Code
100% (1)
Zambian Grid Code
174 pages
Spark
No ratings yet
Spark
96 pages
Healy in Comparison To Royal Rife Machines, Spooky2 and Other Frequency Devices in General
100% (7)
Healy in Comparison To Royal Rife Machines, Spooky2 and Other Frequency Devices in General
2 pages
Py Spark
No ratings yet
Py Spark
19 pages
Module 13 Terminal Operator Chittagong Port
100% (1)
Module 13 Terminal Operator Chittagong Port
53 pages
Desktop Publishing
No ratings yet
Desktop Publishing
10 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
Practical Linear Algebra A Geometry Toolbox
100% (1)
Practical Linear Algebra A Geometry Toolbox
506 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Databricks Spark Exam Notes
No ratings yet
Databricks Spark Exam Notes
27 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
SPARK
No ratings yet
SPARK
27 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Journal
No ratings yet
Journal
47 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Note
No ratings yet
Note
14 pages
Pyspark
No ratings yet
Pyspark
44 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Micro GC 3000
No ratings yet
Micro GC 3000
11 pages
PySpark Essentials for Developers
100% (1)
PySpark Essentials for Developers
21 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
SPARK
No ratings yet
SPARK
35 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Apache Spark
No ratings yet
Apache Spark
6 pages
Spark
No ratings yet
Spark
11 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
DAX Interview Questions
No ratings yet
DAX Interview Questions
8 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Snap
No ratings yet
Snap
46 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Sap Bods: - Vijaya Polisetty
No ratings yet
Sap Bods: - Vijaya Polisetty
51 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Basics of PLC: S.M. No. Rev. No. Effective Date
No ratings yet
Basics of PLC: S.M. No. Rev. No. Effective Date
34 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
SITRAIN Training For: Automation and Industrial Solutions
No ratings yet
SITRAIN Training For: Automation and Industrial Solutions
25 pages
IP Security: True/False & MCQs
No ratings yet
IP Security: True/False & MCQs
5 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Assignment6 2023
No ratings yet
Assignment6 2023
4 pages
Assignment7 2023
No ratings yet
Assignment7 2023
4 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Assignment 0 Solution
No ratings yet
Assignment 0 Solution
3 pages
Assignment Week2 Soln
No ratings yet
Assignment Week2 Soln
3 pages
AEC Module 4 Notes
No ratings yet
AEC Module 4 Notes
97 pages
Mca II Sem Software Engineering and Pattern
No ratings yet
Mca II Sem Software Engineering and Pattern
63 pages
Java OOP Concepts Quiz Answers
No ratings yet
Java OOP Concepts Quiz Answers
19 pages
Ingest 6.5.2 Release Notes
No ratings yet
Ingest 6.5.2 Release Notes
42 pages
Carrier Objective: Linkedin
No ratings yet
Carrier Objective: Linkedin
3 pages
Network Redundancy with STP
No ratings yet
Network Redundancy with STP
39 pages
Org Baldurs Gate II Shadow of Amn Quick Reference Card
No ratings yet
Org Baldurs Gate II Shadow of Amn Quick Reference Card
6 pages
PCI Express Validation with IFV
No ratings yet
PCI Express Validation with IFV
12 pages
TradeGecko B2B ECommerce Getting Started Ebook
No ratings yet
TradeGecko B2B ECommerce Getting Started Ebook
17 pages
First Hop Redundancy Protocols (Ch9)
No ratings yet
First Hop Redundancy Protocols (Ch9)
6 pages
An Adaptive and Modular Blockchain Enabled Architecture For A Decentralized Metaverse
No ratings yet
An Adaptive and Modular Blockchain Enabled Architecture For A Decentralized Metaverse
12 pages
(Chapter 2) Desktop, Icons, and Settings
No ratings yet
(Chapter 2) Desktop, Icons, and Settings
5 pages
Ge Mentor Visual Iq Specifications Spec Sheet Iv11
No ratings yet
Ge Mentor Visual Iq Specifications Spec Sheet Iv11
7 pages
User Manual Part 2 3260293
No ratings yet
User Manual Part 2 3260293
1 page
CN Lec2
No ratings yet
CN Lec2
49 pages
Cs6350 Assignment01
No ratings yet
Cs6350 Assignment01
1 page
CSS Cascade
No ratings yet
CSS Cascade
111 pages

Indrani Cheat Sheet

Uploaded by

Indrani Cheat Sheet

Uploaded by

SPARK and RDD

MOVIE – SPARK SQL RDD AVG AGE BY FIRST LETTER

from pyspark.sql.functions import avg, desc, count

You might also like