Machine Learning in Spark
By:
Muhammad Imran
About Me
Current Position:
• Head of Big Data Project & Advanced
Analytics at Artha Solution
• Head of Big Data technology at Informatica
• Partner at Evotek-S
• Founder of Data Driven Asia
• Data mentor at IDX Startup Incubator
Past Position:
• Big Data Expert Team at Coordinating Ministry
of Economics Affair for Indonesia eCommerce
Roadmap (2018)
• Senior Data Analyst at MCA- Indonesia (2016)
• Data Analyst at UNICEF Innovation Lab
( 2015)
• Data Analyst at World Bank Indonesia ( 2013)
• And so on...
1. Big Data Architecture Stack
2. Apache Spark Architecture
3. RDD, Dataframe & DAG ( Demo - Local )
4. Spark Mlib
Our Discussion 5. Spark ML-Pipelines ( Demo – Local )
Today 6. Spark Structured Streaming ( Demo –
Cloud )
7. Spark Stream-Stream Joins ( Demo -
Cloud )
Next Big Data Architecture
Apache
Spark Stack
Spark Vs Hadoop
MapReduce
Hadoop Map Reduce
Process
Spark Execution
Process Master -
Worker
Simple Scala code in Spark:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.getOrCreate()
import spark.implicits._
RDD Dataframe Dataset
Why used RDD’s ? a DataFrame is an : • Starting in Spark 2.0
• Your data is • immutable distributed • Dataset takes on two
unstructured, such as collection of data. distinct APIs
media streams or • Unlike an RDD, data is characteristics:
streams of text organized into named a strongly-typed API
• you want to manipulate columns, like a table in a and an untyped API
your data with functional relational database.
programming constructs • Designed to make large
than domain specific data sets processing
expressions even easier
RDD Vs Dataframe Vs
Dataset
(Directed Acyclic Graph)
DAG in Apache Spark
DAG main component:
•Timeline view of Spark events
•Execution DAG
•Visualization of Spark Streaming statistics
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for
constructing ML pipelines.
Spark Mlib
• ML Workflow
• ML Algorithms
• More features
The key concepts of Pipeline API (aka spark.ml Components):
Spark Mlib & ML Pipeline • Pipeline
• PipelineStage
• Transformers
• Models
• Estimators
• Evaluator
• Params (and ParamMaps)
Konsep Spark Stream
• Apache Spark streaming adalah system
proses data streaming yang scalable
fault-tolerant. Spark stream adalah
bagian dari Apache Spark yang juga
terintegrasi dengan Mlib, DataFrames &
GraphX.
Ada 4 fungsi utama dari Spark Streaming
yang paling umum digunakan. Yaitu:
- Streaming ETL: Data secara
b e rke la n j u t a n d i b e r s i h k a n d a n d i
aggregate sebelumn di push ke
Databased
- Event Driver/Trigger: Pendeteksian
anomaly data secara real-time dan
action downstream data. Biasa di pakai
untuk system berbasis IOT
- Data enrichment: Data live di perkaya
dengan informasi tambahan dari static
dataset dari DWH untuk menciptakan
complete real-time analysis
- Complex Session & Continous
Learning: Event triggred data yang
terkait dengan live data di groupkan jadi
satu untuk Analisa lebih mendalam.
Biasa dipakai dalam mesin
rekomendasi produk
- Secara garis besar, Spark Streaming
seperti gambar bawah.
Proses Spark Streaming
Spark Structured
Stream – Basic
Concept
Stream Operator
• dropDuplicates
• Explain
• groupBy
• groupByKey
• withWatermark
dropDuplicates code example:
dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]
groupBy code example:
groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*):
RelationalGroupedDataset
Spark Structured
Stream – Basic
Concept
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.getOrCreate()
import spark.implicits._
val wordCounts = words.groupBy("value").count()
Spark Stream-Stream
Joins
impressions = (
spark
.readStream
.format("kafka")
.option("subscribe", "impressions")
…
.load()
)
clicks = (
spark
.readStream
.format("kafka")
.option("subscribe", "clicks")
…
.load()
)
impressions.join(clicks, "adId")
Spark in RStudio
• Cara memulai Spark dalam bahan R, bisa dalam
console Hortons atau Cloudera, namun juga dari
Rstudio.
• Langkahnya:
1. Install Paket R dari CRAN
2. Install Rstudio
3. Ketik:
install.packages("sparklyr")
library(sparklyr) spark_install(version = "2.3.0")
install.packages(“devtools")
devtools::install_github("rstudio/sparklyr")
Lalu kita koneksikan Rstudio dengan Spark:
library(sparklyr) sc <- spark_connect(master =
"local")
Spark in RStudio
Menggunakan dplyr
dplyr adalah tata bahasa manipulasi data,
menyediakan seperangkat kata kerja yang
konsisten yang membantu Anda memecahkan
tantangan manipulasi data yang paling umum:
• bermutasi () menambahkan variabel baru yang
merupakan fungsi dari variabel yang ada
• pilih () mengambil variabel berdasarkan nama
mereka.
• filter () mengambil kasus berdasarkan nilai-
nilainya.
• meringkas () mengurangi beberapa nilai ke
satu ringkasan.
• mengatur () mengubah urutan baris.
Spark in RStudio
Menggunakan dplyr
Di Rstudio, ketik:
install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights,
"flights")
batting_tbl <- copy_to(sc, Lahman::Batting,
"batting")
src_tbls(sc)
Untuk melakukan memfilteran data, coba kita
ketik:
flights_tbl %>% filter(dep_delay == 5)
Spark in Rstudio -
Lanjutan
Mari kita memetakan distribusi delayed dengan
plot.
Di Rstudio, ketik:
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)
Simple Machine
Learning
Kita akan menggunakan regresi liner pada Spark
R:
Function Description
ml_kmeans K-Means Clustering
ml_linear_regression Linear Regression
ml_logistic_regression Logistic Regression
ml_survival_regression Survival Regression
ml_generalized_linear_ Generalized Linear
regression Regression
ml_decision_tree Decision Trees
ml_random_forest Random Forests
ml_gradient_boosted_tr Gradient-Boosted
ees Trees
ml_pca Principal Components
Analysis
ml_naive_bayes Naive-Bayes
Spark in Rstudio -
Lanjutan
Di Rstudio, ketik:
lm_model <- iris_tbl %>% select(Petal_Width,
Petal_Length) %>%
ml_linear_regression(Petal_Length ~
Petal_Width)
iris_tbl %>% select(Petal_Width, Petal_Length)
%>% collect %>% ggplot(aes(Petal_Length,
Petal_Width)) + geom_point(aes(Petal_Width,
Petal_Length), size = 2, alpha = 0.5) +
geom_abline(aes(slope =
coef(lm_model)[["Petal_Width"]], intercept =
coef(lm_model)[["(Intercept)"]]), color = "red") +
labs( x = "Petal Width", y = "Petal Length", title =
"Linear Regression: Petal Length ~ Petal Width",
subtitle = "Use Spark.ML linear regression to
predict petal length as a function of petal width." )
ML Pipeline- Lanjutan
Kita akan membuat ML Pipeline pada Rstudio.
ML Pipeline memungkina data scientist
menciptakan multiple data transformation dalam
sebuah data pipeline
Di Rstudio, ketik:
library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local“)
spark_flights <- sdf_copy_to(sc, flights)
Memasukan fitur2 transformasi data
df <- spark_flights %>%
filter(!is.na(dep_delay)) %>% mutate( month
= paste0("m", month), day = paste0("d",
day) ) %>% select(dep_delay,
sched_dep_time, month, day, distance)
ft_dplyr_transformer(sc, df)
ML Pipeline- Lanjutan
Lanjutan
Di Rstudio, ketik:
ft_dplyr_transformer(sc, df) %>%
ml_param("statement")
Menciptakan 5 jenis transformasi data dalam 1
pipeline:
SQL transformer - Hasil dari transformasi
ft_dplyr_transformer ()
Binarizer - Untuk menentukan apakah
penerbangan harus dianggap penundaan.
Variabel hasil akhirnya.
Bucketizer - Untuk membagi hari menjadi
kelompok jam tertentu
R Formula - Untuk menentukan formula model
Model Logistik
ML Pipeline- Lanjutan
Di Rstudio, ketik:
flights_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer( tbl = df ) %>%
ft_binarizer( input.col = "dep_delay", output.col
= "delayed", threshold = 15 ) %>%
ft_bucketizer( input.col = "sched_dep_time",
output.col = "hours", splits = c(400, 800, 1200,
1600, 2000, 2400) ) %>%
ft_r_formula(delayed ~ month + day + hours +
distance) %>% ml_logistic_regression()
Untuk menampilkan pipeline yg sudah kita
bangun. Ketik di Rstudio:
flights_pipeline