Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
205 views26 pages

Machine Learning in Spark

This document provides an overview of machine learning in Spark using R. It discusses Spark MLlib and Spark ML pipelines for building machine learning models. It also demonstrates using dplyr verbs to manipulate data frames and perform simple linear regression on the iris dataset to predict petal length using petal width. Finally, it shows how to create a machine learning pipeline in R that includes data transformations and a logistic regression model.

Uploaded by

brockthebone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
205 views26 pages

Machine Learning in Spark

This document provides an overview of machine learning in Spark using R. It discusses Spark MLlib and Spark ML pipelines for building machine learning models. It also demonstrates using dplyr verbs to manipulate data frames and perform simple linear regression on the iris dataset to predict petal length using petal width. Finally, it shows how to create a machine learning pipeline in R that includes data transformations and a logistic regression model.

Uploaded by

brockthebone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning in Spark

By:
Muhammad Imran
About Me
Current Position:
• Head of Big Data Project & Advanced
Analytics at Artha Solution
• Head of Big Data technology at Informatica
• Partner at Evotek-S
• Founder of Data Driven Asia
• Data mentor at IDX Startup Incubator
Past Position:
• Big Data Expert Team at Coordinating Ministry
of Economics Affair for Indonesia eCommerce
Roadmap (2018)
• Senior Data Analyst at MCA- Indonesia (2016)
• Data Analyst at UNICEF Innovation Lab
( 2015)
• Data Analyst at World Bank Indonesia ( 2013)
• And so on...
1. Big Data Architecture Stack
2. Apache Spark Architecture
3. RDD, Dataframe & DAG ( Demo - Local )
4. Spark Mlib
Our Discussion 5. Spark ML-Pipelines ( Demo – Local )
Today 6. Spark Structured Streaming ( Demo –
Cloud )
7. Spark Stream-Stream Joins ( Demo -
Cloud )
Next Big Data Architecture
Apache
Spark Stack
Spark Vs Hadoop
MapReduce
Hadoop Map Reduce
Process
Spark Execution
Process Master -
Worker
Simple Scala code in Spark:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
 
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.getOrCreate()

import spark.implicits._
RDD Dataframe Dataset
Why used RDD’s ? a DataFrame is an : • Starting in Spark 2.0
• Your data is • immutable distributed • Dataset takes on two
unstructured, such as collection of data. distinct APIs
media streams or • Unlike an RDD, data is characteristics:
streams of text organized into named a strongly-typed API
• you want to manipulate columns, like a table in a and an untyped API
your data with functional relational database.
programming constructs • Designed to make large
than domain specific data sets processing
expressions even easier

RDD Vs Dataframe Vs
Dataset
(Directed Acyclic Graph)
DAG in Apache Spark
DAG main component:
•Timeline view of Spark events
•Execution DAG
•Visualization of Spark Streaming statistics
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for
constructing ML pipelines.

Spark Mlib
• ML Workflow
• ML Algorithms
• More features

The key concepts of Pipeline API (aka spark.ml Components):


Spark Mlib & ML Pipeline • Pipeline
• PipelineStage
• Transformers
• Models
• Estimators
• Evaluator
• Params (and ParamMaps)
Konsep Spark Stream

• Apache Spark streaming adalah system


proses data streaming yang scalable
fault-tolerant. Spark stream adalah
bagian dari Apache Spark yang juga
terintegrasi dengan Mlib, DataFrames &
GraphX.
Ada 4 fungsi utama dari Spark Streaming
yang paling umum digunakan. Yaitu:
- Streaming ETL: Data secara
b e rke la n j u t a n d i b e r s i h k a n d a n d i
aggregate sebelumn di push ke
Databased
- Event Driver/Trigger: Pendeteksian
anomaly data secara real-time dan
action downstream data. Biasa di pakai
untuk system berbasis IOT
- Data enrichment: Data live di perkaya
dengan informasi tambahan dari static
dataset dari DWH untuk menciptakan
complete real-time analysis
- Complex Session & Continous
Learning: Event triggred data yang
terkait dengan live data di groupkan jadi
satu untuk Analisa lebih mendalam.
Biasa dipakai dalam mesin
rekomendasi produk
- Secara garis besar, Spark Streaming
seperti gambar bawah.
Proses Spark Streaming
Spark Structured
Stream – Basic
Concept
Stream Operator
• dropDuplicates
• Explain
• groupBy
• groupByKey
• withWatermark

dropDuplicates code example:


dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

groupBy code example:


groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*):
RelationalGroupedDataset
Spark Structured
Stream – Basic
Concept
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
 
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.getOrCreate()

import spark.implicits._

val wordCounts = words.groupBy("value").count()


Spark Stream-Stream
Joins
impressions = (
spark
.readStream
.format("kafka")
.option("subscribe", "impressions")

.load()
)
clicks = (

spark
.readStream
.format("kafka")
.option("subscribe", "clicks")

.load()
)
impressions.join(clicks, "adId")
Spark in RStudio
• Cara memulai Spark dalam bahan R, bisa dalam
console Hortons atau Cloudera, namun juga dari
Rstudio.
• Langkahnya:
1. Install Paket R dari CRAN
2. Install Rstudio
3. Ketik:
install.packages("sparklyr")
library(sparklyr) spark_install(version = "2.3.0")
install.packages(“devtools")
devtools::install_github("rstudio/sparklyr")
Lalu kita koneksikan Rstudio dengan Spark:
library(sparklyr) sc <- spark_connect(master =
"local")
Spark in RStudio
Menggunakan dplyr
dplyr adalah tata bahasa manipulasi data,
menyediakan seperangkat kata kerja yang
konsisten yang membantu Anda memecahkan
tantangan manipulasi data yang paling umum:
• bermutasi () menambahkan variabel baru yang
merupakan fungsi dari variabel yang ada
• pilih () mengambil variabel berdasarkan nama
mereka.
• filter () mengambil kasus berdasarkan nilai-
nilainya.
• meringkas () mengurangi beberapa nilai ke
satu ringkasan.
• mengatur () mengubah urutan baris.
Spark in RStudio
Menggunakan dplyr
Di Rstudio, ketik:
install.packages(c("nycflights13", "Lahman"))

library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights,
"flights")
batting_tbl <- copy_to(sc, Lahman::Batting,
"batting")
src_tbls(sc)
Untuk melakukan memfilteran data, coba kita
ketik:
flights_tbl %>% filter(dep_delay == 5)
Spark in Rstudio -
Lanjutan
Mari kita memetakan distribusi delayed dengan
plot.
Di Rstudio, ketik:
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)
Simple Machine
Learning
Kita akan menggunakan regresi liner pada Spark
R:
Function Description
ml_kmeans K-Means Clustering
ml_linear_regression Linear Regression
ml_logistic_regression Logistic Regression
ml_survival_regression Survival Regression
ml_generalized_linear_ Generalized Linear
regression Regression
ml_decision_tree Decision Trees
ml_random_forest Random Forests
ml_gradient_boosted_tr Gradient-Boosted
ees Trees
ml_pca Principal Components
Analysis
ml_naive_bayes Naive-Bayes
Spark in Rstudio -
Lanjutan
Di Rstudio, ketik:
lm_model <- iris_tbl %>% select(Petal_Width,
Petal_Length) %>%
ml_linear_regression(Petal_Length ~
Petal_Width)

iris_tbl %>% select(Petal_Width, Petal_Length)


%>% collect %>% ggplot(aes(Petal_Length,
Petal_Width)) + geom_point(aes(Petal_Width,
Petal_Length), size = 2, alpha = 0.5) +
geom_abline(aes(slope =
coef(lm_model)[["Petal_Width"]], intercept =
coef(lm_model)[["(Intercept)"]]), color = "red") +
labs( x = "Petal Width", y = "Petal Length", title =
"Linear Regression: Petal Length ~ Petal Width",
subtitle = "Use Spark.ML linear regression to
predict petal length as a function of petal width." )
ML Pipeline- Lanjutan
Kita akan membuat ML Pipeline pada Rstudio.
ML Pipeline memungkina data scientist
menciptakan multiple data transformation dalam
sebuah data pipeline
Di Rstudio, ketik:
library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local“)

spark_flights <- sdf_copy_to(sc, flights)


Memasukan fitur2 transformasi data
df <- spark_flights %>%
filter(!is.na(dep_delay)) %>% mutate( month
= paste0("m", month), day = paste0("d",
day) ) %>% select(dep_delay,
sched_dep_time, month, day, distance)

ft_dplyr_transformer(sc, df)
ML Pipeline- Lanjutan
Lanjutan
Di Rstudio, ketik:
ft_dplyr_transformer(sc, df) %>%
ml_param("statement")
Menciptakan 5 jenis transformasi data dalam 1
pipeline:
SQL transformer - Hasil dari transformasi
ft_dplyr_transformer ()
Binarizer - Untuk menentukan apakah
penerbangan harus dianggap penundaan.
Variabel hasil akhirnya.
Bucketizer - Untuk membagi hari menjadi
kelompok jam tertentu
R Formula - Untuk menentukan formula model
Model Logistik
ML Pipeline- Lanjutan
Di Rstudio, ketik:
flights_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer( tbl = df ) %>%
ft_binarizer( input.col = "dep_delay", output.col
= "delayed", threshold = 15 ) %>%
ft_bucketizer( input.col = "sched_dep_time",
output.col = "hours", splits = c(400, 800, 1200,
1600, 2000, 2400) ) %>%
ft_r_formula(delayed ~ month + day + hours +
distance) %>% ml_logistic_regression()

Untuk menampilkan pipeline yg sudah kita


bangun. Ketik di Rstudio:
flights_pipeline

You might also like