0% found this document useful (0 votes)

32 views51 pages

Big Data Platforms: Yogesh Simmhan

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views51 pages

Big Data Platforms: Yogesh Simmhan

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Indian Institute of Science Department of Computational and Data Sciences

Bangalore, India
भारतीय विज्ञान संस्थान
बंगलौर, भारत

Big Data Platforms

Yogesh Simmhan
simmhan @iisc .ac.in
Slide Credits:
• https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
• https://www.slideshare.net/deanchen11/scala-bay-spark-talk
• https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf
• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, M. Zaharia, et al., NSDI 2012
• http://spark.apache.org/docs/latest/programming-guide.html
2020/01/23
©Department of Computational and Data Science, IISc, 2016
This work is licensed under a Creative Commons Attribution 4.0 International License
CDS 3
Copyright for external content used with attribution is retained by their original authors Department of Computational and Data Sciences
CDS.IISc.ac.in | Department of Computational and Data Sciences

What is Big Data?

2020/01/23 4
Image credits: http://www.seekbig.in/1128-tnpsc-economics-questions/
CDS.IISc.ac.in | Department of Computational and Data Sciences

The term is fuzzy … Handle with care!

Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014

2020/01/23 https://datascience.berkeley.edu/what-is-big-data/ 5
CDS.IISc.ac.in | Department of Computational and Data Sciences

So…What is Big Data?

Data whose characteristics exceeds
the capabilities of conventional
algorithms, systems and
techniques to derive useful value.
https://www.oreilly.com/ideas/what-is-big-data

2020/01/23 6
Image Credits: https://community.uservoice.com/wp-content/uploads/benefits-of-effective-questions-800x448-300x168.jpg
CDS.IISc.ac.in | Department of Computational and Data Sciences

And, where does Big

Data come from?

2020/01/23 7
CDS.IISc.ac.in | Department of Computational and Data Sciences

Web & Social Media

▪ Web search, Social Networks & Micro-blogs

http://static4.businessinsider.com/image/56b089cedd0895437c8b45ef-2390-1265/untitled.png
2020/01/23 http://www.internetlivestats.com/twitter-statistics/ 8
CDS.IISc.ac.in | Department of Computational and Data Sciences

Web & Social Media

▪ Social Networks & Micro-blogs

1.79 billion monthly active users as of September 30, 2016

https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
2020/01/23 http://www.wsj.com/articles/facebook-profit-jumps-sharply-1478117646 9
http://newsroom.fb.com/company-info/
CDS.IISc.ac.in | Department of Computational and Data Sciences

Enterprises & Government

▪ Online retail & eCommerce

http://blogs.ft.com/beyond-brics/2014/02/28/online- http://www.peridotcapital.com/2014/04/amazon-sales-growth-projections-
retail-in-india-learning-to-evolve/ for-next-two-years-appear-overly-optimistic.html

2020/01/23 10
CDS.IISc.ac.in | Department of Computational and Data Sciences

Enterprises & Government:

Finance
▪ Mobile Transactions & FinTech

Since November 8, 2016,

Paytm has surpassed its
metrics -tripling
transactions per day to
7.5 million

2020/01/23 http://www.pymnts.com/in-depth/2015/mobile-transactions/ 11
Is Paytm the Xerox of mobile payments?, ETtech.com-03-Jan-2017
CDS.IISc.ac.in | Department of Computational and Data Sciences

Internet of Everything
▪ Personal Devices
‣ Smart Phones,
Fitbit
▪ Smart Appliances
▪ Smart Cities
‣ Power, Water,
Transportation,
Environment
▪ Smart Retail
▪ Millions of sensor
data streams
2020/01/23 smartx.cds.iisc.ac.in 12
CDS.IISc.ac.in | Department of Computational and Data Sciences

Why is Big Data

Difficult?

2020/01/23 13
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 14
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 15
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 16
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 17
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 http://www.ibmbigdatahub.com/infographic/four-vs-big-data 18
CDS.IISc.ac.in | Department of Computational and Data Sciences

Data Analysis Lifecycle

• Acquire Data
• Sensors, Web logs & crawls, Transactions
Acquire

• Define Analytics
• Trends, Clusters, Outliers, Classification
Goal

• Translate to Scalable Applications

• Develop algorithms, Map to abstractions, Implement on
Process Platforms

2020/01/23 19
CDS.IISc.ac.in | Department of Computational and Data Sciences

Data Platforms
▪Acquire, manage, process Big Data
▪At large scales
▪To meet application needs

2020/01/23 20
CDS.IISc.ac.in | Department of Computational and Data Sciences

Distributed Systems
▪ Distributed Computing
‣ Clusters of machines
‣ Connected over network
▪ Distributed Storage
‣ Disks attached to clusters of machines
‣ Network Attached Storage
▪ How can we make effective use of multiple machines?

▪ Commodity clusters vs. HPC clusters

‣ Commodity: Available off the shelf at large volumes
‣ Lower Cost of Acquisition
‣ Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPU typically comparable (Xeon vs. i3/5/7)
• Virtualization overhead on Cloud
▪ How can we use many machines of modest capability?
2020/01/23 21
CDS.IISc.ac.in | Department of Computational and Data Sciences

Growth of Cloud Data Centers

2020/01/23
Cisco Global Cloud Index: Forecast and Methodology, 2015–2020, White Paper © 2016, Cisco 22
CDS.IISc.ac.in | Department of Computational and Data Sciences

Ideal Strong/Weak Scaling

Problem size per

processor is fixed
Problem size
is fixed

2020/01/23 23
Scaling Theory and Machine Abstractions, Martha A. Kim, October 10, 2012
CDS.IISc.ac.in | Department of Computational and Data Sciences

Scalability
▪ Strong vs. Weak Scaling
▪ Strong Scaling: How the performance varies with
the # of processors for a fixed total problem size
▪ Weak Scaling: How the performance varies with
the # of processors for a fixed problem size per
processor
‣ Big Data platforms are intended for “Weak Scaling”

2020/01/23 24
CDS.IISc.ac.in | Department of Computational and Data Sciences

Ease of Programming
▪ Programming distributed systems is difficult
‣ Divide a job into multiple tasks
‣ Understand dependencies between tasks: Control, Data
‣ Coordinate and synchronize execution of tasks
‣ Pass information between tasks
‣ Avoid race conditions, deadlocks
▪ Parallel and distributed programming
models/languages/abstractions/platforms try to
make these easy
‣ E.g. Assembly programming vs. C++ programming
‣ E.g. C++ programming vs. Matlab programming
2020/01/23 25
CDS.IISc.ac.in | Department of Computational and Data Sciences

Availability, Failure
▪ Commodity clusters have lower reliability
‣ Mass-produced
‣ Cheaper materials
‣ Smaller lifetime (~3 years)
▪ How can applications easily deal with failures?
▪ How can we ensure availability in the presence of faults?

2020/01/23 26
CDS.IISc.ac.in | Department of Computational and Data Sciences

Early Technologies
▪ MapReduce is a distributed data-parallel programming
model from Google
▪ MapReduce works best with a distributed file system,
called Google File System (GFS)
▪ Hadoop is the open source framework implementation
from Apache that can execute the MapReduce
programming model
▪ Hadoop Distributed File System (HDFS) is the open
source implementation of the GFS design
▪ Elastic MapReduce (EMR) is Amazon’s PaaS
2020/01/23 27
CDS.IISc.ac.in | Department of Computational and Data Sciences

Platforms…Think in terms of Stacks

Cloudera

practicalanalytics.co
2020/01/23 28
CDS.IISc.ac.in | Department of Computational and Data Sciences

Platforms…Think in terms of Stacks

BDAS

2020/01/23 https://amplab.cs.berkeley.edu/software/ 29
CDS.IISc.ac.in | Department of Computational and Data Sciences

Platforms…Think in terms of Stacks

HortonWorks

2020/01/23 http://hortonworks.com/products/data-center/hdp/ 30
CDS.IISc.ac.in | Department of Computational and Data Sciences

Apache Spark
Slides & Additional Reading Courtesy
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Resilient Distributed Datasets, Matei Zaharia
http://spark.apache.org/docs/2.1.1/programming-guide.html
http://spark.apache.org/docs/latest/api/java/index.html
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
Apache Spark Internals, Pietro Michiardi, Eurecom

2020/01/23 31
CDS.IISc.ac.in | Department of Computational and Data Sciences

Why Spark?
▪ Ease of language definition
‣ Typing, dataflows,
‣ But Pig, Hive, HBase, etc. give you that

▪ Better performance using “In memory” compute

‣ Multiple stages part of same job
‣ Lazy evaluation, caching/persistence

2020/01/23 32
CDS.IISc.ac.in | Department of Computational and Data Sciences

In-memory computation
▪ Operate on data in (distributed) memory
‣ Allows many operations to be performed locally
‣ Write to disk only when data sharing required across workers
▪ This is unlike others like Hadoop Map/Reduce

2020/01/23 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, M. Zaharia, et al., NSDI 2012
33
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD: The Secret Sauce

▪ RDD: Resilient Distributed Dataset
‣ Immutable, partitioned collection of tuples
‣ Operated on by deterministic transformations
• Object-oriented flavor
• RDD.operation() → RDD
▪ Recovery by re-computation
‣ Maintains lineage of transformations
‣ Recompute missing partitions if failure happens
‣ Not possible/not automatic in Pig
▪ Allows caching & persistence for reuse

2020/01/23 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, M. Zaharia, et al., NSDI 2012
34
CDS.IISc.ac.in | Department of Computational and Data Sciences

2020/01/23 35
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD Partitions
▪ RDD is internally a collection of partitions
‣ Each partition holds a list of items
▪ Partitions may be present on a different machine
‣ Partition is the unit of execution
‣ Partition is the unit of parallelism
▪ They are immutable
‣ Each transformation on an RDD generates a new RDD with
different partitions
‣ Allows recovery of individual partitions

2020/01/23 36
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD Operations Allows

composability
into Dataflows

2020/01/23 37
CDS.IISc.ac.in | Department of Computational and Data Sciences
https://grouplens.org/datasets/movielens/

A Sample Spark Program

▪ Movielens dataset, movies.csv
‣ movieId,title,genres
m = sc.textFile("hdfs:///ml/movies.csv").cache()
[‘movieId,title,genres’]...
mcols = m.map(lambda l: l.split(",")).
mg = mcols.filter(lambda l: l[2] != 'genres’)
[‘92363’,‘Toy Story’,‘cartoon|action|children’]...
mgc = mg.map(lambda l: (len(l[2].split("|")), l))
[3,[‘92363’,‘Toy Story’,‘cartoon|action|children’]]...
maxgc = mgc.max()[0]
3
maxgcm = mgc.lookup(maxgc)
[3,[‘92363’,‘Toy Story’,‘cartoon|action|children’]]...
2020/01/23 38
CDS.IISc.ac.in | Department of Computational and Data Sciences

What is the average number of ratings

given by users? What is the average value of
the ratings given by users?
m = sc.textFile("hdfs:///user/ml/movies.csv").cache()
r = sc.textFile("hdfs:///user/ml/ratings.csv").cache()

rv = r.map(lambda l : l.split(",")[2]).filter(lambda l
: l != 'rating')
rvs = rv.reduce(lambda a, b: float(a) + float(b)) #
sum of ratings
rvc = rv.count() # ratings count
print 'Avg rating value is', rvs/rvc

rc = r.count() - 1 # number of ratings

rud = r.map(lambda l : l.split(",")[0]).distinct()
ruc = (rud.count()-1) # number of distinct users
print 'Avg ratings per user is', rc/ruc

2020/01/23 39
CDS.IISc.ac.in | Department of Computational and Data Sciences

For movies with more than 1 genre, what are the

most and least likely pair of genres to occur
together?
me = m.map(lambda l : l if l.find("\"") == -1 else l.partition("\"")[0] +
l[l.find("\"")+1:l.rfind("\"")-1].replace(",", ";") +
l.rpartition("\"")[2])

mg = me.map(lambda l:l.split(",")).filter(lambda l : l[2] != 'genres')

mgf = mg.flatMap(lambda l : zip([l[0]]*len(l[2].split("|")),
l[2].split("|")))

mgj = mgf.join(mgf).filter(lambda (m,g) : g[0] != g[1])

mgpc = mgj.map(lambda (m,g) : ('+'.join(sorted(g)),1))
msgp = mgpc.reduceByKey(lambda a, b: a + b).map(lambda (gp,s) : (s,gp))
gpmax = msgp.max()
gpmin = msgp.min()

print 'Genres pairs most likely to occur are',gpmax[1],'with a

freq',gpmax[0]
print 'Genres pairs least likely to occur are',gpmin[1],'with a
freq',gpmin[0]

2020/01/23 40
CDS.IISc.ac.in | Department of Computational and Data Sciences

Creating RDD
▪ Load external data from distributed storage
▪ Create logical RDD on which you can operate
▪ Support for different input formats
‣ HDFS files, Cassandra, Java serialized, directory, gzipped
▪ Can control the number of partitions in loaded RDD
‣ Default depends on external DFS, e.g. 128MB on HDFS

m = sc.textFile("hdfs:///ml/movies.csv").cache()

2020/01/23 41
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD Operations
▪ Transformations
‣ From one RDD to one or more RDDs
‣ Lazy evaluation upon “action”…use with care
‣ Executed in a distributed manner

▪ Actions
‣ Perform aggregations on RDD items
‣ Return single (or distributed) results to “driver” code
‣ RDD.collect() brings RDD partitions to single driver
machine

2020/01/23 42
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD and PairRDD

▪ RDD is logically a collection of items with a generic
type
▪ PairRDD is a 2-tuple, like a “Map”, where each item
in the collection is a <key,value> pair
‣ But can have duplicate keys
▪ Transformation functions use RDD or PairRDD as
input/output

2020/01/23 43
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformations

Implicit in
PySpark

Also removes
duplicates

2020/01/23
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD 44
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformations on
PairRDD

2020/01/23 45
CDS.IISc.ac.in | Department of Computational and Data Sciences

Aggregation: Average number

of ratings given by users
[userId,movieId,rating,timestamp]
rv = r.map(lambda l: l.split(",")[2])
rfv = rv.filter(lambda l:
l != 'rating’)
[rating]...
rvs = rfv.reduce(lambda a, b: Action
float(a) + float(b))
rvc = rfv.count() Action
print rvs/rvc
2020/01/23 46
CDS.IISc.ac.in | Department of Computational and Data Sciences

Actions

2020/01/23 47
CDS.IISc.ac.in | Department of Computational and Data Sciences

Samples: Per-key average

sumCount =
rdd.mapValues(x -> (x,1)).
reduceByKey((x, y) ->
(x[0]+y[0], x[1]+y[1]))

2020/01/23 https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
48
CDS.IISc.ac.in | Department of Computational and Data Sciences

RDD Persistence & Caching

▪ RDDs can be reused in a dataflow
‣ Branch, iteration
▪ But it will be re-evaluated each time it is reused!
▪ Explicitly persist RDD to reuse output of a dataflow
path multiple times
▪ Multiple storage levels for persistence
‣ Disk or memory
‣ Serialized or object form in memory
‣ Partial spill-to-disk possible
‣ Cache indicates “persist” to memory
2020/01/23 49
CDS.IISc.ac.in | Department of Computational and Data Sciences

Distributed Execution

2020/01/23 51
CDS.IISc.ac.in | Department of Computational and Data Sciences

Execution Dependency
NARROW DEPENDENCY: Each partition of the WIDE DEPENDENCY: Multiple child
parent RDD is used by at most one partition of partitions may depend on one partition of
the child RDD. Task can be executed locally and the parent RDD. We have to shuffle data
we don’t have to shuffle. unless the parents are hash-partitioned

2020/01/23 52
CDS.IISc.ac.in | Department of Computational and Data Sciences

Lazy Execution

2020/01/23 53
CDS.IISc.ac.in | Department of Computational and Data Sciences

From DAG to RDD lineage

2020/01/23 54
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-transformations.html

Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Big Data Challenges and Solutions
No ratings yet
Big Data Challenges and Solutions
36 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Big Data Tech Deloitte
No ratings yet
Big Data Tech Deloitte
27 pages
Week 02
No ratings yet
Week 02
115 pages
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
No ratings yet
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
12 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Data Science
No ratings yet
Data Science
87 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Bda U1
No ratings yet
Bda U1
80 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Module - 1
No ratings yet
Module - 1
84 pages
Unit 5
No ratings yet
Unit 5
32 pages
Dlmdsbdt01 06 Wrap Up
No ratings yet
Dlmdsbdt01 06 Wrap Up
29 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Big Data: Hadoop Framework Guide
No ratings yet
Big Data: Hadoop Framework Guide
3 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
17 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Big Data Network
No ratings yet
Big Data Network
33 pages
Inside Cloud - Case Study
No ratings yet
Inside Cloud - Case Study
11 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Big Data Architecture
No ratings yet
Big Data Architecture
17 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
00 - 00 DS - Overview - FRAMEWORK
No ratings yet
00 - 00 DS - Overview - FRAMEWORK
63 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Bda Module-1
No ratings yet
Bda Module-1
55 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
BDA Notes
No ratings yet
BDA Notes
18 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
BigdatMid1 Shcema
No ratings yet
BigdatMid1 Shcema
7 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Chapter 09 - in Class
No ratings yet
Chapter 09 - in Class
34 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Newton 1
No ratings yet
Newton 1
114 pages
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
No ratings yet
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
27 pages
Kinematic Wave Theory
No ratings yet
Kinematic Wave Theory
2 pages
Klausen 1999
No ratings yet
Klausen 1999
20 pages
1D Conservation Laws
No ratings yet
1D Conservation Laws
38 pages
Biorthogonal System
No ratings yet
Biorthogonal System
54 pages
Nonlinear Differential Equations
No ratings yet
Nonlinear Differential Equations
25 pages
Lie NOTES
No ratings yet
Lie NOTES
114 pages
Mathematicians & Computational Scientists
No ratings yet
Mathematicians & Computational Scientists
35 pages
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
No ratings yet
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
156 pages
Combustion Tutorials 3dsmax Elements
No ratings yet
Combustion Tutorials 3dsmax Elements
30 pages
Modern Programming Tools and Techniques: DCAP505
No ratings yet
Modern Programming Tools and Techniques: DCAP505
28 pages
Park psm74b - 1
No ratings yet
Park psm74b - 1
9 pages
RF Heating: Created in COMSOL Multiphysics 5.3a
No ratings yet
RF Heating: Created in COMSOL Multiphysics 5.3a
22 pages
Advances in Carbohydrate Chemistry and Biochemistry Secure Ebook Download
No ratings yet
Advances in Carbohydrate Chemistry and Biochemistry Secure Ebook Download
17 pages
Noting and Drafting Skills
100% (2)
Noting and Drafting Skills
33 pages
Project Two
No ratings yet
Project Two
14 pages
Mitosis Lecture PDF
No ratings yet
Mitosis Lecture PDF
11 pages
Saeed Updated CV
No ratings yet
Saeed Updated CV
14 pages
Black Dog Institute Online Clinic Assessment Report
No ratings yet
Black Dog Institute Online Clinic Assessment Report
7 pages
Research Paper 12 Abm Efficient Honrados Group
No ratings yet
Research Paper 12 Abm Efficient Honrados Group
34 pages
Visitors Guide. Motril History Museum
No ratings yet
Visitors Guide. Motril History Museum
24 pages
Garduate Nurse Perceptions of The Work Experience
No ratings yet
Garduate Nurse Perceptions of The Work Experience
7 pages
Preparation 7 - Ointments
No ratings yet
Preparation 7 - Ointments
8 pages
WinDNC V06 02 NewFeatures en
100% (3)
WinDNC V06 02 NewFeatures en
2 pages
TCS
No ratings yet
TCS
43 pages
HR Interview Questions
No ratings yet
HR Interview Questions
8 pages
Character - Lorian Nod
No ratings yet
Character - Lorian Nod
2 pages
Impulse Invariance and Bilinear
No ratings yet
Impulse Invariance and Bilinear
8 pages
Radiant July 2018
No ratings yet
Radiant July 2018
18 pages
Consent Document For Enrolling Adult Participants in A Research Study
No ratings yet
Consent Document For Enrolling Adult Participants in A Research Study
3 pages
Formulation, Development and in Vitro Characterization of Modified Release Tablets of Capecitabine
No ratings yet
Formulation, Development and in Vitro Characterization of Modified Release Tablets of Capecitabine
42 pages
Contact Process for Sulphuric Acid
No ratings yet
Contact Process for Sulphuric Acid
8 pages
CHEMISTRY Exam
No ratings yet
CHEMISTRY Exam
8 pages
Sample Guard House Drawing-Model
No ratings yet
Sample Guard House Drawing-Model
1 page
IJRPR15453
No ratings yet
IJRPR15453
7 pages
Design & Implement Trash Rack Cleaning System
No ratings yet
Design & Implement Trash Rack Cleaning System
23 pages
Smit Vipul Kalamkar - CV
No ratings yet
Smit Vipul Kalamkar - CV
2 pages
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
No ratings yet
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
295 pages

Big Data Platforms: Yogesh Simmhan

Uploaded by

Big Data Platforms: Yogesh Simmhan

Uploaded by

Indian Institute of Science Department of Computational and Data Sciences

Big Data Platforms

What is Big Data?

The term is fuzzy … Handle with care!

Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014

So…What is Big Data?

And, where does Big

Web & Social Media

Web & Social Media

1.79 billion monthly active users as of September 30, 2016

Enterprises & Government

Enterprises & Government:

Since November 8, 2016,

Why is Big Data

Data Analysis Lifecycle

• Translate to Scalable Applications

▪ Commodity clusters vs. HPC clusters

Growth of Cloud Data Centers

Ideal Strong/Weak Scaling

Problem size per

Platforms…Think in terms of Stacks

Platforms…Think in terms of Stacks

Platforms…Think in terms of Stacks

▪ Better performance using “In memory” compute

RDD: The Secret Sauce

RDD Operations Allows

A Sample Spark Program

What is the average number of ratings

rc = r.count() - 1 # number of ratings

For movies with more than 1 genre, what are the

mg = me.map(lambda l:l.split(",")).filter(lambda l : l[2] != 'genres')

mgj = mgf.join(mgf).filter(lambda (m,g) : g[0] != g[1])

print 'Genres pairs most likely to occur are',gpmax[1],'with a

RDD and PairRDD

Aggregation: Average number

Samples: Per-key average

RDD Persistence & Caching

From DAG to RDD lineage

You might also like