Scala and the JVM for Big Data:
Lessons from Spark
polyglotprogramming.com/talks
[email protected]
@deanwampler
1
©Dean Wampler 2014-2019, All Rights Reserved
Spark
2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1
Resilient Distributed
Datasets 4
Productivity?
Very concise, elegant, functional APIs.
•Scala, Java
•Python, R
•... and SQL!
5
Productivity?
Interactive shell (REPL)
•Scala, Python, R, and SQL
6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...
... ... ...
wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...
wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)
... ... hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(..
block hive (.../hive,1)
... ... ... ...
Miracle!!
wikipedia.org/hbase HBase stores...
... ...
block
... ...
block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sparkContext = new SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1)) // (id, content)
}.flatMap {
case (id, content) =>
toWords(content).map(word => ((word,id),1)) // toWords not shown
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq) // Sort the value seq by count, desc.
}.saveAsTextFile("/path/to/output") 13
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sparkContext = new
SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
14
toWords(contents).map(w => ((w,id),1))
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sparkContext = new
RDD[String]: .../hadoop, Hadoop provides...
SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
RDD[(String,String)]: (.../hadoop,Hadoop provides...)
case (id, contents) =>
15
toWords(contents).map(w => ((w,id),1))
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
RDD[((String,String),Int)]: ((Hadoop,.../hadoop),20)
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
16
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
17
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
18
Productivity?
textFile
map
Intuitive API: flatMap
•Dataflow of steps. reduceByKey
map
•Inspired by Scala collections groupByKey
and functional programming. map
saveAsTextFile
19
Performance?
textFile
map
Lazy API: flatMap
•Combines steps into “stages”. reduceByKey
map
•Cache intermediate data in groupByKey
memory. map
saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
val planes_for_flights1 = sqlContext.sql("""
SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum LIMIT 100""")
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
25
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()
val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
26
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
val planes_for_flights1 = sqlContext.sql("""
SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
27
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
val planes_for_flights1 = sqlContext.sql("""
SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
28
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
…
… …
Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …
Window of 3 RDD Batches #1
Window of 3 RDD Batches #2
34
ML/MLlib
K-Means
•Machine Learning requires:
•Iterative training of models.
•Good linear algebra perf.
GraphX
PageRank
•Graph algorithms require:
•Incremental traversal.
•Efficient edge and node reps.
Foundation:
The JVM 39
20 Years of
DevOps
Lots of Java Devs 40
Tools and Libraries
Akka
Breeze
Algebird
Spire & Cats
Axle
...
41
Big Data Ecosystem
42
But it’s
not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection
45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.
46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.
47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model
49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3
“second”
“first”
“third”
Arrays “zeroth”
51
val person: Person
name: String
“Buck Trends”
age: Int 29
addr: Address
… …
Class Instances
52
Hash Map
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”
Hash Maps
53
Improving Performance
Why obsess about this?
Spark jobs are CPU bound:
•Improve network I/O? ~2% better.
•Improve disk I/O? ~20% better. 54
What changed?
•Faster HW (compared to ~2000)
•10Gbs networks
•SSDs.
55
What changed?
•Smarter use of I/O
•Pruning unneeded data sooner.
•Caching more effectively.
•Efficient formats, like Parquet. 56
What changed?
•But more CPU use today:
•More Serialization.
•More Compression.
•More Hashing (joins, group-bys). 57
Improving Performance
To improve performance, we need to
focus on the CPU, the:
•Better algorithms, sure.
•And optimize use of memory. 58
Project Tungsten
Initiative to greatly improve
Dataset/DataFrame performance.59
Goals
60
Reduce References
val myArray: Array[String]
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val myArray: Array[String]
val person: Person 0 1 2 3
“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”
h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")
•Evaluating expressions billions of
times:
•Virtual function calls.
•Boxing/unboxing.
•Branching (if statements, etc.) 63
Implementation
64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length
offset to var. len. data
•Compute hashCode and equals on
raw bytes. 65
val person: Person
name: String
•Compare: age: Int
addr: Address
29
“Buck Trends”
… …
null bit set (1bit/field) values (8bytes/field) variable length
offset to var. len. data
66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
67
Hash Map
h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
68
Memory Management
•Some allocations off heap.
•sun.misc.Unsafe.
69
Less Expression Overhead
sql("SELECT a + b FROM table")
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
Less Expression Overhead
sql("SELECT a + b FROM table")
•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types
(Planned for Java 9 or 10)
73
case class Timestamp(epochMillis: Long) {
def toString: String = { ... }
def add(delta: TimeDelta): Timestamp = {
/* return new shifted time */
}
Don’t allocate on the heap;
... just push the primiRve long
} on the stack.
(scalac does this now.)
74
Long operations
aren’t atomic
According to the
JVM spec
75
No Unsigned Types
What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion
scala> val array = Array.fill[Short](N)(0)
array: Array[Short] = Array(0, 0, ...)
scala> import
org.apache.spark.util.SizeEstimator
scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...
scala> SizeEstimator.estimate(b)
res0: Long = 2368
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
79
scala> SizeEstimator.estimate(b)
res0: Long = 2368
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
84
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
Pass this closure to
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) RDD.map:
... i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
85
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
Verify that it’s
at java.io.ByteArrayOutputStream.write(...)
...
“clean” (serializable).
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
i => b.value(i)
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)
86
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
...which it does by
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
serializing to a byte array...
at org.apache.spark.rdd.RDD.map(...)
87
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
at java.util.Arrays.copyOf(...)
...
...which requires copying
at java.io.ByteArrayOutputStream.write(...)
...
an array...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) What array???
...
i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
...
at org.apache.spark.rdd.RDD.map(...)
scala> val array = Array.fill[Short](N)(0)
... 88
Why did this
happen?
89
•You write:
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
90
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
class $iwC extends Serializable {
sc.parallelize(...).map(i => b.value(i))
}
} 91
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles: ... sucks in the whole object!
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
So, this closure over “b”...
class $iwC extends Serializable {
sc.parallelize(...).map(i => b.value(i))
}
} 92
Lightbend is
investigating
re-engineering
the REPL 93
Workarounds...
94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...
95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val array = Array.fill[Short](N)(0)
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
polyglotprogramming.com/talks
lightbend.com/fast-data-platform
[email protected]
@deanwampler
Questions?
Bonus Material
You can find an extended version of this
talk with more details at
polyglotprogramming.com/talks
100