Indian Institute of Science Department of Computational and Data Sciences
Bangalore, India
भारतीय विज्ञान संस्थान
बंगलौर, भारत
Big Data Platforms
Yogesh Simmhan
simmhan @iisc .ac.in
Slide Credits:
• https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
• https://www.slideshare.net/deanchen11/scala-bay-spark-talk
• https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf
• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, M. Zaharia, et al., NSDI 2012
• http://spark.apache.org/docs/latest/programming-guide.html
2020/01/23
©Department of Computational and Data Science, IISc, 2016
This work is licensed under a Creative Commons Attribution 4.0 International License
CDS 3
Copyright for external content used with attribution is retained by their original authors Department of Computational and Data Sciences
CDS.IISc.ac.in | Department of Computational and Data Sciences
What is Big Data?
2020/01/23 4
Image credits: http://www.seekbig.in/1128-tnpsc-economics-questions/
CDS.IISc.ac.in | Department of Computational and Data Sciences
The term is fuzzy … Handle with care!
Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014
2020/01/23 https://datascience.berkeley.edu/what-is-big-data/ 5
CDS.IISc.ac.in | Department of Computational and Data Sciences
So…What is Big Data?
Data whose characteristics exceeds
the capabilities of conventional
algorithms, systems and
techniques to derive useful value.
https://www.oreilly.com/ideas/what-is-big-data
2020/01/23 6
Image Credits: https://community.uservoice.com/wp-content/uploads/benefits-of-effective-questions-800x448-300x168.jpg
CDS.IISc.ac.in | Department of Computational and Data Sciences
And, where does Big
Data come from?
2020/01/23 7
CDS.IISc.ac.in | Department of Computational and Data Sciences
Web & Social Media
▪ Web search, Social Networks & Micro-blogs
http://static4.businessinsider.com/image/56b089cedd0895437c8b45ef-2390-1265/untitled.png
2020/01/23 http://www.internetlivestats.com/twitter-statistics/ 8
CDS.IISc.ac.in | Department of Computational and Data Sciences
Web & Social Media
▪ Social Networks & Micro-blogs
1.79 billion monthly active users as of September 30, 2016
https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
2020/01/23 http://www.wsj.com/articles/facebook-profit-jumps-sharply-1478117646 9
http://newsroom.fb.com/company-info/
CDS.IISc.ac.in | Department of Computational and Data Sciences
Enterprises & Government
▪ Online retail & eCommerce
http://blogs.ft.com/beyond-brics/2014/02/28/online- http://www.peridotcapital.com/2014/04/amazon-sales-growth-projections-
retail-in-india-learning-to-evolve/ for-next-two-years-appear-overly-optimistic.html
2020/01/23 10
CDS.IISc.ac.in | Department of Computational and Data Sciences
Enterprises & Government:
Finance
▪ Mobile Transactions & FinTech
Since November 8, 2016,
Paytm has surpassed its
metrics -tripling
transactions per day to
7.5 million
2020/01/23 http://www.pymnts.com/in-depth/2015/mobile-transactions/ 11
Is Paytm the Xerox of mobile payments?, ETtech.com-03-Jan-2017
CDS.IISc.ac.in | Department of Computational and Data Sciences
Internet of Everything
▪ Personal Devices
‣ Smart Phones,
Fitbit
▪ Smart Appliances
▪ Smart Cities
‣ Power, Water,
Transportation,
Environment
▪ Smart Retail
▪ Millions of sensor
data streams
2020/01/23 smartx.cds.iisc.ac.in 12
CDS.IISc.ac.in | Department of Computational and Data Sciences
Why is Big Data
Difficult?
2020/01/23 13
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 14
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 15
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 16
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 17
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 18
CDS.IISc.ac.in | Department of Computational and Data Sciences
Data Analysis Lifecycle
• Acquire Data
• Sensors, Web logs & crawls, Transactions
Acquire
• Define Analytics
• Trends, Clusters, Outliers, Classification
Goal
• Translate to Scalable Applications
• Develop algorithms, Map to abstractions, Implement on
Process Platforms
2020/01/23 19
CDS.IISc.ac.in | Department of Computational and Data Sciences
Data Platforms
▪Acquire, manage, process Big Data
▪At large scales
▪To meet application needs
2020/01/23 20
CDS.IISc.ac.in | Department of Computational and Data Sciences
Distributed Systems
▪ Distributed Computing
‣ Clusters of machines
‣ Connected over network
▪ Distributed Storage
‣ Disks attached to clusters of machines
‣ Network Attached Storage
▪ How can we make effective use of multiple machines?
▪ Commodity clusters vs. HPC clusters
‣ Commodity: Available off the shelf at large volumes
‣ Lower Cost of Acquisition
‣ Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPU typically comparable (Xeon vs. i3/5/7)
• Virtualization overhead on Cloud
▪ How can we use many machines of modest capability?
2020/01/23 21
CDS.IISc.ac.in | Department of Computational and Data Sciences
Growth of Cloud Data Centers
2020/01/23
Cisco Global Cloud Index: Forecast and Methodology, 2015–2020, White Paper © 2016, Cisco 22
CDS.IISc.ac.in | Department of Computational and Data Sciences
Ideal Strong/Weak Scaling
Problem size per
processor is fixed
Problem size
is fixed
2020/01/23 23
Scaling Theory and Machine Abstractions, Martha A. Kim, October 10, 2012
CDS.IISc.ac.in | Department of Computational and Data Sciences
Scalability
▪ Strong vs. Weak Scaling
▪ Strong Scaling: How the performance varies with
the # of processors for a fixed total problem size
▪ Weak Scaling: How the performance varies with
the # of processors for a fixed problem size per
processor
‣ Big Data platforms are intended for “Weak Scaling”
2020/01/23 24
CDS.IISc.ac.in | Department of Computational and Data Sciences
Ease of Programming
▪ Programming distributed systems is difficult
‣ Divide a job into multiple tasks
‣ Understand dependencies between tasks: Control, Data
‣ Coordinate and synchronize execution of tasks
‣ Pass information between tasks
‣ Avoid race conditions, deadlocks
▪ Parallel and distributed programming
models/languages/abstractions/platforms try to
make these easy
‣ E.g. Assembly programming vs. C++ programming
‣ E.g. C++ programming vs. Matlab programming
2020/01/23 25
CDS.IISc.ac.in | Department of Computational and Data Sciences
Availability, Failure
▪ Commodity clusters have lower reliability
‣ Mass-produced
‣ Cheaper materials
‣ Smaller lifetime (~3 years)
▪ How can applications easily deal with failures?
▪ How can we ensure availability in the presence of faults?
2020/01/23 26
CDS.IISc.ac.in | Department of Computational and Data Sciences
Early Technologies
▪ MapReduce is a distributed data-parallel programming
model from Google
▪ MapReduce works best with a distributed file system,
called Google File System (GFS)
▪ Hadoop is the open source framework implementation
from Apache that can execute the MapReduce
programming model
▪ Hadoop Distributed File System (HDFS) is the open
source implementation of the GFS design
▪ Elastic MapReduce (EMR) is Amazon’s PaaS
2020/01/23 27
CDS.IISc.ac.in | Department of Computational and Data Sciences
Platforms…Think in terms of Stacks
Cloudera
practicalanalytics.co
2020/01/23 28
CDS.IISc.ac.in | Department of Computational and Data Sciences
Platforms…Think in terms of Stacks
BDAS
2020/01/23 https://amplab.cs.berkeley.edu/software/ 29
CDS.IISc.ac.in | Department of Computational and Data Sciences
Platforms…Think in terms of Stacks
HortonWorks
2020/01/23 http://hortonworks.com/products/data-center/hdp/ 30
CDS.IISc.ac.in | Department of Computational and Data Sciences
Apache Spark
Slides & Additional Reading Courtesy
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Resilient Distributed Datasets, Matei Zaharia
http://spark.apache.org/docs/2.1.1/programming-guide.html
http://spark.apache.org/docs/latest/api/java/index.html
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
Apache Spark Internals, Pietro Michiardi, Eurecom
2020/01/23 31
CDS.IISc.ac.in | Department of Computational and Data Sciences
Why Spark?
▪ Ease of language definition
‣ Typing, dataflows,
‣ But Pig, Hive, HBase, etc. give you that
▪ Better performance using “In memory” compute
‣ Multiple stages part of same job
‣ Lazy evaluation, caching/persistence
2020/01/23 32
CDS.IISc.ac.in | Department of Computational and Data Sciences
In-memory computation
▪ Operate on data in (distributed) memory
‣ Allows many operations to be performed locally
‣ Write to disk only when data sharing required across workers
▪ This is unlike others like Hadoop Map/Reduce
2020/01/23 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, M. Zaharia, et al., NSDI 2012
33
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD: The Secret Sauce
▪ RDD: Resilient Distributed Dataset
‣ Immutable, partitioned collection of tuples
‣ Operated on by deterministic transformations
• Object-oriented flavor
• RDD.operation() → RDD
▪ Recovery by re-computation
‣ Maintains lineage of transformations
‣ Recompute missing partitions if failure happens
‣ Not possible/not automatic in Pig
▪ Allows caching & persistence for reuse
2020/01/23 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, M. Zaharia, et al., NSDI 2012
34
CDS.IISc.ac.in | Department of Computational and Data Sciences
2020/01/23 35
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD Partitions
▪ RDD is internally a collection of partitions
‣ Each partition holds a list of items
▪ Partitions may be present on a different machine
‣ Partition is the unit of execution
‣ Partition is the unit of parallelism
▪ They are immutable
‣ Each transformation on an RDD generates a new RDD with
different partitions
‣ Allows recovery of individual partitions
2020/01/23 36
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD Operations Allows
composability
into Dataflows
2020/01/23 37
CDS.IISc.ac.in | Department of Computational and Data Sciences
https://grouplens.org/datasets/movielens/
A Sample Spark Program
▪ Movielens dataset, movies.csv
‣ movieId,title,genres
m = sc.textFile("hdfs:///ml/movies.csv").cache()
[‘movieId,title,genres’]...
mcols = m.map(lambda l: l.split(",")).
mg = mcols.filter(lambda l: l[2] != 'genres’)
[‘92363’,‘Toy Story’,‘cartoon|action|children’]...
mgc = mg.map(lambda l: (len(l[2].split("|")), l))
[3,[‘92363’,‘Toy Story’,‘cartoon|action|children’]]...
maxgc = mgc.max()[0]
3
maxgcm = mgc.lookup(maxgc)
[3,[‘92363’,‘Toy Story’,‘cartoon|action|children’]]...
2020/01/23 38
CDS.IISc.ac.in | Department of Computational and Data Sciences
What is the average number of ratings
given by users? What is the average value of
the ratings given by users?
m = sc.textFile("hdfs:///user/ml/movies.csv").cache()
r = sc.textFile("hdfs:///user/ml/ratings.csv").cache()
rv = r.map(lambda l : l.split(",")[2]).filter(lambda l
: l != 'rating')
rvs = rv.reduce(lambda a, b: float(a) + float(b)) #
sum of ratings
rvc = rv.count() # ratings count
print 'Avg rating value is', rvs/rvc
rc = r.count() - 1 # number of ratings
rud = r.map(lambda l : l.split(",")[0]).distinct()
ruc = (rud.count()-1) # number of distinct users
print 'Avg ratings per user is', rc/ruc
2020/01/23 39
CDS.IISc.ac.in | Department of Computational and Data Sciences
For movies with more than 1 genre, what are the
most and least likely pair of genres to occur
together?
me = m.map(lambda l : l if l.find("\"") == -1 else l.partition("\"")[0] +
l[l.find("\"")+1:l.rfind("\"")-1].replace(",", ";") +
l.rpartition("\"")[2])
mg = me.map(lambda l:l.split(",")).filter(lambda l : l[2] != 'genres')
mgf = mg.flatMap(lambda l : zip([l[0]]*len(l[2].split("|")),
l[2].split("|")))
mgj = mgf.join(mgf).filter(lambda (m,g) : g[0] != g[1])
mgpc = mgj.map(lambda (m,g) : ('+'.join(sorted(g)),1))
msgp = mgpc.reduceByKey(lambda a, b: a + b).map(lambda (gp,s) : (s,gp))
gpmax = msgp.max()
gpmin = msgp.min()
print 'Genres pairs most likely to occur are',gpmax[1],'with a
freq',gpmax[0]
print 'Genres pairs least likely to occur are',gpmin[1],'with a
freq',gpmin[0]
2020/01/23 40
CDS.IISc.ac.in | Department of Computational and Data Sciences
Creating RDD
▪ Load external data from distributed storage
▪ Create logical RDD on which you can operate
▪ Support for different input formats
‣ HDFS files, Cassandra, Java serialized, directory, gzipped
▪ Can control the number of partitions in loaded RDD
‣ Default depends on external DFS, e.g. 128MB on HDFS
m = sc.textFile("hdfs:///ml/movies.csv").cache()
2020/01/23 41
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD Operations
▪ Transformations
‣ From one RDD to one or more RDDs
‣ Lazy evaluation upon “action”…use with care
‣ Executed in a distributed manner
▪ Actions
‣ Perform aggregations on RDD items
‣ Return single (or distributed) results to “driver” code
‣ RDD.collect() brings RDD partitions to single driver
machine
2020/01/23 42
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD and PairRDD
▪ RDD is logically a collection of items with a generic
type
▪ PairRDD is a 2-tuple, like a “Map”, where each item
in the collection is a <key,value> pair
‣ But can have duplicate keys
▪ Transformation functions use RDD or PairRDD as
input/output
2020/01/23 43
CDS.IISc.ac.in | Department of Computational and Data Sciences
Transformations
Implicit in
PySpark
Also removes
duplicates
2020/01/23
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD 44
CDS.IISc.ac.in | Department of Computational and Data Sciences
Transformations on
PairRDD
2020/01/23 45
CDS.IISc.ac.in | Department of Computational and Data Sciences
Aggregation: Average number
of ratings given by users
[userId,movieId,rating,timestamp]
rv = r.map(lambda l: l.split(",")[2])
rfv = rv.filter(lambda l:
l != 'rating’)
[rating]...
rvs = rfv.reduce(lambda a, b: Action
float(a) + float(b))
rvc = rfv.count() Action
print rvs/rvc
2020/01/23 46
CDS.IISc.ac.in | Department of Computational and Data Sciences
Actions
2020/01/23 47
CDS.IISc.ac.in | Department of Computational and Data Sciences
Samples: Per-key average
sumCount =
rdd.mapValues(x -> (x,1)).
reduceByKey((x, y) ->
(x[0]+y[0], x[1]+y[1]))
2020/01/23 https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
48
CDS.IISc.ac.in | Department of Computational and Data Sciences
RDD Persistence & Caching
▪ RDDs can be reused in a dataflow
‣ Branch, iteration
▪ But it will be re-evaluated each time it is reused!
▪ Explicitly persist RDD to reuse output of a dataflow
path multiple times
▪ Multiple storage levels for persistence
‣ Disk or memory
‣ Serialized or object form in memory
‣ Partial spill-to-disk possible
‣ Cache indicates “persist” to memory
2020/01/23 49
CDS.IISc.ac.in | Department of Computational and Data Sciences
Distributed Execution
2020/01/23 51
CDS.IISc.ac.in | Department of Computational and Data Sciences
Execution Dependency
NARROW DEPENDENCY: Each partition of the WIDE DEPENDENCY: Multiple child
parent RDD is used by at most one partition of partitions may depend on one partition of
the child RDD. Task can be executed locally and the parent RDD. We have to shuffle data
we don’t have to shuffle. unless the parents are hash-partitioned
2020/01/23 52
CDS.IISc.ac.in | Department of Computational and Data Sciences
Lazy Execution
2020/01/23 53
CDS.IISc.ac.in | Department of Computational and Data Sciences
From DAG to RDD lineage
2020/01/23 54
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-transformations.html