Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
28 views66 pages

Week3 Large Scale ML

The document discusses large-scale machine learning techniques, focusing on gradient-based optimization and the benefits of parallel computation. It covers various algorithms like linear and logistic regression, optimization methods, and the implementation of these techniques using Spark's MLlib. Additionally, it highlights the importance of data and model parallelism in handling big data challenges in machine learning.

Uploaded by

qabilamine30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views66 pages

Week3 Large Scale ML

The document discusses large-scale machine learning techniques, focusing on gradient-based optimization and the benefits of parallel computation. It covers various algorithms like linear and logistic regression, optimization methods, and the implementation of these techniques using Spark's MLlib. Additionally, it highlights the importance of data and model parallelism in handling big data challenges in machine learning.

Uploaded by

qabilamine30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Large-Scale

Machine Learning
CSCI316: Big Data Mining Techniques and Implementation
Contents

Gradient-based optimization in
machine learning

Parallel computation for large-scale


ML

Spark’s machine learning library:


MLlib

2
Machine Learning Meets Big Data

3
Machine Learning Meets Big Data

4
How can ML benefit from parallelism
• In Lecture 8, we saw that common relational and algebraic operations
can benefit from parallel computation (in MapReduce and Spark)
• How about machine learning?
– Observation: Most machine learning algorithms involve a group of
mathematical operations called Gradient-based Optimisation and, in
particular, Gradient Descent.
• Hence, our objective is to see:
Ø What is gradient-based optimisation, and
Ø How can it benefit from parallel computation.

5
Recall What ML is.
• Ingredients of a learning algorithm [Goodfellow et al. 2016]
– Task (e.g., classification, regression, clustering, etc.)
– Performance measure (e.g., accuracy, MSE, etc.)
– Experience (e.g., dataset w/o class labels)
• For supervised learning, given a set of observations X with labels y,
learn a model ℳ that predicts the labels of X as good as possible in
items of some cost function.
– In the following, we assume that X is the output of pre-processing and
contains only numerical values.
– E.g., if the data includes categorical values, encoding applies first.

6
Optimisation in Machine Learning
• Most ML algorithms involves optimisation on a continuous (or piece-
wise continuous) function !(#) of some sort.
– That is, to find an instance of #∗ such that !(#∗) is minimum.
– Most often, ! is a cost function (such as the MSE) for the ML model
and # represent the model parameters that needs to be determined (by
training data)
– For simplicity of presentation, allows assume ! is continuous
• Decision tree regression models have piece-wise cont. cost functions
• If ! is also differentiable (or piece-wise differentiable) on # (again,
this is often the case), you can:
– in some special cases, exploit the algebraic property of ! to find a
closed-form expression of #∗ and ! #∗ , or
– more generally, apply the generic technique of gradient descent to
find &∗

7
Linear Regression
• The goal of linear regression is to build a linear function that maps the
data to the predicted values.
• Given an input vector ! in ℝ# , we define the output as %$ = '( !
where ' is a vector of parameters. Let % be the true value.
– Note that if we fix the first element in ! as 1, then we obtain the form %$ =
'′( !* + , where '′ is the weights and , is the bias
• For a set of observations, we have a -×/ martix 0 and a vector 1×/
vector 2 where / is the number of observations.
6
• The mean squared error (MSE) is 345 = ∑7 $9 − %9 )> =
9:6(%
7
6
2 − 2||>> .
||@
7

8
Linear Regression
• One nice property of linear regression is that
%
• To minimise the above MSE, i.e., min ) − )||++ , we have a closed
||(
$ &
0 2% 0
form solution $∗ = (/ /) / )
• In other words, if $ = $∗ , the MSE (mean squared error) is
minimum.
• In practice, it means that, for any given / and ) (if / is not very
large), we can compute the parameter vector quickly!

9
Logistic Regression
• Logistic regression builds a likelihood function (i.e., a logistic
function) that provides a probability that the data belongs to each
class.
• !(∑'$%& ($ )$ + +) = !(./ 0 + +)
– 0 is a tuple (i.e., record),
– ., + parameters
v here we segregate . and +.
34 &
– ! 2 = = = 1 − ! −2
&53 4 &53 64
• We model the probability of a label Y of some record with features 0
being 9 ∈ {−1, 1} as
1
= >=9 0 = B
1 + ? @A(. 05C)

10
Logistic Regression
• The goal is to seek suitable !, # that maximise the “overall”
probability for all $ records, i.e.
1 1
1
max )(!, #) ≔ - ) 2. = 4. 5. ) = - 9:; (!<5; =()
!,(
./0 ./0
1 + 8
• This equals to minimising the following log-likelihood function:
1
< 5 =()
min @(!, #) ≔ A log(1 + 8 9:;(! ; )
!,(
./0
• Unfortunately, there is no closed-form for the above problem…
• However, in practice, we can still solve it efficiently by using a
generic technique called gradient descent (and its variant).

11
Unconstrained Optimisation
• In general, we have a cost function ! which we would like to
minimise.
• Unconstrained optimisation involves finding minima of functions that
may have multiple inputs.
• min ! % = !(%∗ ) where % ∈ ℝ, , !: ℝ, → ℝ.
• Note that we deal with only minima because any maximum of ! is a
minimum of – !
• Because the shape of ! can be very complex, we do not always aim to
find the global minimum in ℝ, but instead points that take smaller
function values than all of their neighbours.
– A local minimum is good enough.

12
Local and Global Minima

13
A General Algorithm
• Iterative optimization algorithms consist of setting
initialization conditions, and then three iteration steps:
1. Initialize: Choose a starting point – an initial guess that can either
be determined by your situation or that can be actively chosen.
2. While a stopping criterion is not true (the solution is not close
enough to the minimum), continue, else break and return the
current solution.
3. Find a descent direction – a direction in which the function
value decreases near the current point.
4. Determine the step size – the length of a step in the given direction
that leads to a good decrease.
5. Go back to step 2 until while loop exits.

14
Local and Global Minima

15
Gradient
• For a function ! with multiple inputs " (i.e., multivariate functions),
we use partial derivatives to measure how much !(") changes as only
the variables %& increases at point ":
'!
(")
'%&
• The gradient generalises the notion od derivative to the case where the
derivative is with respect to a vector.
• The gradient of ! at " is the vector containing all the partial
derivatives, denoted by
'! '!
∇! " = ,…,
'%* '%-

16
Gradient Example
• What is the gradient of ! ", $, % = " − "$ + % ) ?
• Answer:

17
Stopping Criterion
• The minimum always has a certain property, the first derivative, i.e.,
&' &'
the gradient has to be zero: ∇" #∗ = ,…, =-
&() &(,
• The gradient can be a vector with two components, and the above
equation translates into the vector equation ∇" ., / =
(1"/1., 1"/1/)4 = (0,0)4 , i.e., each component of the gradient must
be zero.
• However, zero gradient is not a sufficient condition for achieving
minimum (or maximum)
• We can the points with zero gradient critical points.

18
Gradient Descent for MSE
%
• Goal: Find min ||($ − *||++ ⇔ min ||($ − *||++
$ & $
• That is, find $∗ that minimises ||($ − *||++
• Consider a univariate function (for some index . ∈ [1, 3]):
&

5 $ = ||$7 ([., : ] − *||++ = 9(=> (:) − @ (:) )+


:;%

19
Gradient Descent
Gradient Descent
• Start at a random point
• Repeat
•Start at a random
– Determine point
a descent direction
– Choose a step size
•Repeat
– Update
- Determine
• Until a descent
stopping criterion direction
is satisfied

error
- Choose a step size
- Update
•Until stopping criterion
is satisfied
weight

2017-11-08 ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 11/91
20
Small Steps Down the Gradient

21
Non-Convex Optimisation
• Both the linear regression (with a MSE cost function) and the logistic
regression are convex optimisation problems, i.e., the local minima
are the global minima.
• However, in general many ML models and cost functions result in
non-convex problems.
• Gradient descent still works, down to a local minimum.

22
Direction of Descent - Slope
• Know the error function, for example, ! " = "$%&ℎ( )
*+(-)
• Then, = 2 0 "$%&ℎ( ⇒ −2
*-
• Move in the opposite direction of the gradient.

23
Descent Direction and Magnitude (1D)
• The opposition direction of the slope points in the direction of steepest
error descent in weight space
step size
𝑑𝑓
𝑤𝑖+1 = 𝑤𝑖 − 𝛼𝑖 𝑤𝑖 ! refers to the i-th
𝑑𝑤
step (or epoch)
017-11-08 17/91
negative slope
• Step size is a free parameter that has to be chosen carefully for each
problem. It can be (and usually is) updated dynamically during the
iteration.

24
Step Size

$
• An example step size is !" =
% "
– where & is the number of training points and ' is the iteration step,
and ! is a constant.

25
Update Rules for MSE
• Now consider how to update all weights for the linear regression.
• Scalar objective:
-

! " = ||"% − '||(( = )("/ (*) − 1 (*) )(


*+,
• Derivative:
-
2!
" = 2 ) "/ (*) − 1 (*) / (*)
2"
*+,
• Scalar update:
-

"45, = "4 − 64 ) "4 / (*) − 1 (*) / (*)


*+,
• Vector update:
-

745, = 74 − 64 ) 784 %(*) − 1 (*) %(*)


*+,

26
Large-Scale ML and Parallel
Computation
Large-Scale ML
• Recall that ML updates a model ℳ based on data "
– For linear regression (with the MSE cost function, without
regularisation), the model is expressed by the vector #.
• Big data: the input data " is too large to hold in the main memory.
• Big model: the model ℳ is too large to hold in the main memory
• Data parallelism and model parallelism

Data parallelism Model parallelism

28
Gradient Descent: Big m, Big d
• Vector update: !"#$ = !" − '" ∑+ , ())
)*$ !" - − 0 ()) -())
• By data parallelism, we compute summands in parallel on workers
receiving all !" at every iteration.
• For example, let 1 = 6 and the number of works is 3:

Note. here n=m

• Bottleneck: the transfer of !" between the driver (or parameter server)
and the workers.

29
Gradient Descent: Big m, Huge d
• In linear regression, the data size (i.e., !×#) is much larger than the
model size (i.e., #). But in other ML algorithms, the model size can be
huge even compared with the data size.
– As an example, consider a polynomial regression model with only
two data records: $ % = ∑(,*+, -(,* . ( / *
– In deep learning, the number of parameters can reach billions.
• In this case, model parallelism separates the models across different
computing nodes and let different partitions of parameters be
processed by different CPUs and GPUs.
• Divide and Conquer: fully update each partition locally and minimise
the global communication.

30
Stochastic Gradient Descent (SGD)
• Recall gradient descent for linear regression (with MSE), every
iteration processes ! samples:
• "#$% = "# − (# ∑, - (*)
*+% "# . − 1 (*) .(*)
• The gradient is an expectation that can be approximated using a small
set of samples drawn uniformly at random from the whole data set.
– In particular, if the data set is highly redundant – the gradient in the
first half will be very similar to that in the second half.
• In SGD, we update the model with only one sample instead of !.
• SGD can improves the training speed and mitigate the large data issue
• Practice also shows that SGD helps the algorithm jump out of local
minima and find the global minima.

31
Minibatch Gradient Descent
• Increase the batch size from 1 to a smaller number than ".
• Divide the data set into small batches of examples, compute the
gradient using a single batch, makes an update, then move to the next
batch of examples.
– Computing the gradient simultaneously using the matrix-matrix
multiplications which are efficient, especially on GPUs
– For classification, ideally mini-batches need to be balanced for classes
(e.g., using stratified sampling)
• E.g., " = 10,
(.

%&'( = %& − *& + %/& 0(,) − 3 (,) 0(,)


,-(

32
Synchronous vs. Asynchronous GD
• In synchronous (minibatch) GD, the Spark Driver (or a parameter
server) will wait until all parallel workers have returned their updated
model before continuing to the next iteration.
– Fast workers wait for slow workers
– Can guarantee fault-tolerance
• In (parallel) asynchronous GD (or minibatch GD), a parameter server
will apply model updates from parallel workers immediately,
whereupon the work can immediately get new copy of the model to
work on a new mini-batch.
– Workers train the model concurrently on mini-batches without
blocking
– Needs to work with stale gradients

33
Beyond GD: Gradient-Based Optimisation
• First-order gradient-based optimisation:
– GD, stochastic GD and mini-batch GD
• Second-order gradient-based optimisation
– Newton’s method, which exploits the “gradient of gradient” (i.e.,
Hessian matrix) to determine the step size.
– Quasi Newton’s method, such as Limited-memory BFGS (LBFGS),
which approximates the “gradient of gradient” with less complexity.

34
Gradient-Based Optimisation in Spark
• The architecture of Spark is well-suited to data parallelism and
synchronise model update (including gradient-based model update).
– Model parallelism and asynchronies update are not really supported in
a standard Spark architecture
– Also, the model must fit in the Spark Driver’s memory.
– A parameter server can handle large models and asynchronies update
(currently a hot research topic)
v More information about the gradient-based optimisation methods
implemented in Spark ML:
• https://spark.apache.org/docs/latest/mllib-optimization.html
v Spark also use methods (e.g., LBFGS) from the Scala scientific
computing library Breeze
• https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/L
BFGS.scala

35
GD Implementation in Spark from scratch
• Consider the linear
regression problem
with big , and big
-:

training_rdd.cache() # training_rdd is an RDD including the data of feature


matrix ! and target vector "; we persist it across iterations
d = len(training_rdd.take(1)); w = np.zeros(d-1)
for i in range(numIters):
gradient = training_rdd.map(lambda rec: udf(w, rec)).sum()
# rec is the pair #, % from ! and ", resp.
# udf(w, rec) is a user-defined function that produces ('( # − %)#
# we pass udf() to the map transformation; sum() is the reduce action
alpha_i = alpha / (m * np.sqrt(i+1)); w -= alpha_i * gradient
v Note. In general, the udf is defined w.r.t. the lost function +(').
36
Spark MLlib
End-to-End Project in Spark

38
The Process
• The key steps in the process of an end-to-end project:
1. Gathering and collecting the relevant data for your task.
2. Cleaning and inspecting the data to better understand it.
3. Performing feature engineering to allow the algorithm to leverage
the data in a suitable form (e.g., converting the data to numerical
vectors).
4. Using a portion of this data as a training set to train one or more
algorithms to generate some candidate models.
5. Evaluating and comparing models by objectively measuring results
on a subset of the same data that was not used for training.
6. Leveraging the insights from the above process and/or using the
model to make predictions, detect anomalies, or solve more general
business challenges.

39
High-Level MLlib Concepts

40
High-Level MLlib Concepts
• Transformers
– functions that convert raw data in some way.
– E.g., to create a new interaction variable, convert string categorical
values into numerical values
– primarily used in pre-processing and feature engineering
– takes a DataFrame as input and produces a new DataFrame as output

41
High-Level MLlib Concepts
• Estimators
– if provided with data, result in transformers
– algorithms that are used to train models
• Evaluators
– evaluate how a given model performs according to criteria (e.g.,
accuracy, ROC)
• Pipeline
– MLlib’s highest-level data type
– Transformers, estimators and evaluators are all stages in a
pipeline
– Similar to Scikit-Learn’s pipeline API
• Low-Level Data Types
– Vectors and Matrices (local vs distributed, dense vs sparse)

42
MLlib in Action
• Example 1: a (synthetic) dataset
df = spark.read.json(dataset)
df.orderBy("value2").show(5)
+-----+----+------+------------------+
|color| lab|value1| value2|
+-----+----+------+------------------+
| red|good| 35|14.386294994851129|
| blue| bad| 12|14.386294994851129|
| red| bad| 2|14.386294994851129|
| blue| bad| 8|14.386294994851129|
| red| bad| 16|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows
q a categorical label with two values (good or bad), a categorical
variable (colour), and two numerical variables.

43
MLlib in Action
• Feature Engineering with Transformers
– As mentioned, transformers manipulate existing columns in and add
new columns to a DataFrame
– In MLlib, all inputs to ML algorithms in Spark must consist of type
Double (for labels) and Vector[Double] (for features).
• Note. our synthetic dataset does not meet this requirement
– RFormula: a declarative language for specifying ML transformers and
is simple to use (supports a limited subset of the R operators):
~ Separate target and terms
+ Contact terms
- Remove terms
: Interaction
. All columns except the target/dependent variable

44
MLlib in Action
• In our case, we use all variables and also add in the interactions
between some columns.
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . + color:value1 +
color:value2")
• The next step is to fit and apply the RFormula transformer to data
– Just call the fit and transform methods of an RFormula instance:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show(4)
+-----+----+------+------------------+--------------------+-----+
|color| lab|value1| value2| features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good| 1|14.386294994851129|(10,[1,2,3,5,8],[...| 1.0|
| blue| bad| 8|14.386294994851129|(10,[2,3,6,9],[8....| 0.0|
+-----+----+------+------------------+--------------------+-----+

45
MLlib in Action
• Look in the transformer output:
preparedDF.select("features").show(2,False)
+--------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------+
|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])|
|(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]) |
+--------------------------------------------------------------------+

– Behind the scenes, RFormula inspects the data during the fit call and
creates an RFormula model
type(fittedRF)
Out: pyspark.ml.feature.RFormulaModel
– When applying this model to transform the DF, Spark converts the
categorical values into Doubles and creates additional features for
interactions between color and value1/value2

46
Logistic Regression with Regularization
• Example: Will someone have a heart attack over the next year?

• Classification: Yes/No (i.e., 1 or − 1)


• Logistic Regression: Likelihood of heart attack, given by
%(∑+()* ,( -( + /) = %(23 4 + /)
– 4 is a tuple (i.e., record),
– 2, / parameters
78 *
– % 6 = = = 1 − % −6
*97 8 *97 :8

47
Logistic Regression with Regularization
• The goal is to maximize the following log-likelihood function:
89: ($;<: =&)
max '($, )) ≔ ∑0 -./ −log(1 + 7 ) (see previous slides)
$,&
• In practice, we usually consider adding a regularization term
max '($, >) + ?@($)
$,&
– where ? ≥ 0,
G
– @ C = E| $| G + (1 − E)| $ |/
G
• with | $| G the Euclidean norm and | $ |/ the Manhattan distance.
– E ∈ [0,1] a coefficient (e.g. an elastic net parameter)
• This extra term helps prevent overfitting the LR model.
• ? and E are hyperparameters of the logistic regression classifier.
v Variants and more sophisticate versions of LR are implemented in
Spark MLlib.

48
MLlib in Action
• Create a simple test set based on a sample split of the data
train, test = preparedDF.randomSplit([0.7, 0.3])
• Fit a model (logistic regression classifier)
– We can simply instantiate a LogisticRegression model, using the
default configuration or hyperparameters.
– set the label columns and the feature columns; the column names—
label and features—are actually the default labels.
from pyspark.ml.classification
import LogisticRegression
lr = LogisticRegression(labelCol="label",
featuresCol="features")
– We can kick off a Spark job to train the model by the following:
fittedLR = lr.fit(train)

49
MLlib in Action
– Apply the model to the training dataset and see the prediction:
fittedLR.transform(train).select("label", "prediction").show()
+-----+----------+
|label|prediction|
+-----+----------+
| 0.0| 0.0|
| ...| ...|
| 1.0| 1.0|
| ...| ...|

– Next, we can evaluate this model and calculate the performance metrics
(e.g., TP rate and FN rate)
• In practice, we also need to try out difference combinations of model
hyperparameters
• call lr.explainParams() to retrieve a list of hyperparameters
– The Pipeline interface saves the manual effort on model selection and
hyperparameter tuning.

50
MLlib in Action
• Pipelining steps in an advanced analytics workflow
– The Pipeline interface allows to set up a dataflow of a sequence of
related operations that ends with an estimator.

51
MLlib in Action
Pipeline as an Estimator:

Pipeline as a Transformer:

52
MLlib in Action
• To build a pipeline, first split the data based on the original dataset df
(not the preparedDF used before)
train, test = df.randomSplit([0.7, 0.3])
• Recall that each stage in a Pipeline is either a transformer or an
estimator.
• There are two estimators in our case: for RFormula and the logistic
regression classifier.
rForm = RFormula()
lr = LogisticRegression()\
.setLabelCol("label").setFeaturesCol("features")
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)
o A “logical” pipeline is built in the last step.

53
MLlib in Action
• Training and Evaluation
– Train several variations of the model by specifying different
combinations of the hyperparameters
– MLlib provides a ParamGridBuilder class for this purpose:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
.addGrid(rForm.formula, [
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2"])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.addGrid(lr.regParam, [0.1, 2.0])\
.build()
q In the above example code, we have selected:
• 2 versions of RFormula
• 3 options for the ElasticNet parameter
• 2 options for the regularization parameters

54
MLlib in Action
• Thus, we want to evaluate a total of 12 different combinations of
parameters
• To determine which combination is optimal, we select an evaluator
and an evaluation metric.
– In this case, BinaryClassificationEvaluator and the areaUnderROC
metric.
from pyspark.ml.evaluation import
BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
.setMetricName("areaUnderROC")\
.setRawPredictionCol("prediction")\
.setLabelCol("label")
• To actually perform the evaluation that determines a best model that
we train, some kind of validation is needed
– which is different from the holdout test dataset (in order to avoid
overfitting).

55
MLlib in Action
• Dataset splits:
training
training
all data validation
testing
– The validation and testing usually can share common methods.
• split the training dataset into two different groups (used below)
• perform K-fold cross-validation, etc.
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\ #how to split training set
.setEstimatorParamMaps(params)\
.setEstimator(pipeline)\
.setEvaluator(evaluator)

56
MLlib in Action
• We are ready to train the model:
tvsFitted = tvs.fit(train)
type(tvsFitted)
Out: pyspark.ml.tuning.TrainValidationSplitModel
• Finally, we can test it with our holdout dataset
evaluator.evaluate(tvsFitted.transform(test))
Out: 0.9523809523809523 # may be different each time
Alternatively:
evaluator.evaluate(tvsFitted.bestModel.transform(test))
Out: 0.9523809523809523 # may be different each time
Check out models in the pipeline:
tvsFitted.bestModel
Out: PipelineModel_483685a888d17d310e60
tvsFitted.bestModel.stages
Out: [RFormula_4a689a, LogisticRegression_44dcb9a]

57
MLlib in Action
• Usually, a single pipeline in Spark includes one ML algorithm (as an
estimator).
– If you want to imply multiple competing ML algorithms (e.g., a DT
classifier and a LR classifier), you may need to specify multiple
pipeline manually.
• Persisting and Applying Models
– To save the model (to facilitate future usage),
tvsFitted.bestModel.write().save(“path...”)
– To load the model
from pyspark.ml.pipeline import PipelineModel
myModel = PipelineModel.load(“path...”)

58
MLlib in Action
• Model Deployment

59
Deployment Patterns
• Train an ML model offline and then supply it with offline data.
• Train a model offline and then put the results into a database (usually
a key-value store).
• Train an ML algorithm offline, persist the model to disk, and then use
that for serving.
• Manually (or via some other software) convert a distributed model to
one that can run much more quickly on a single machine.
• Train an ML algorithm online and use it online (e.g., in conjunction
with Spark’s streaming modules)

60
Pipeline Example (Decision Tree)
• Dataset: a file sample_libsvm_data.txt in the LIBSVM format

– Each recorded is in the follow format


<label> <index1>:<value1> <index2>:<value2> ...
– Records are separated ‘\n’.

61
Pipeline Example (Decision Tree)
# https://spark.apache.org/docs/latest/ml-classification-
regression.html#decision-tree-classifier
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm")\
.load("../sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4
# distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features",
outputCol="indexedFeatures", maxCategories=4).fit(data)

62
Pipeline Example (Decision Tree)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)

63
Pipeline Example (Decision Tree)
• Decision Tree model hyperparameters
– maxDepth
– maxBins
continuous features are converted into categorical features and maxBins
determines how many bins should be created from continuous features. The
default is 32.
– impurity
entropy or gini (default)
– minInfoGain
– minInstancePerNode

64
Summary
• Gradient-based optimisation (e.g., gradient descent)
– An important mathematical tool, widely adopted in training popular
ML models
• Large-scale ML: model parallelism vs. data parallelism
– Implementation of gradient-descent in Spark
• Spark MLlib
– High-level concepts: Transformer, estimator, pipeline
– Examples: logistic regression, decision tree

65
Questions?

You might also like