0% found this document useful (0 votes)

28 views66 pages

Week3 Large Scale ML

The document discusses large-scale machine learning techniques, focusing on gradient-based optimization and the benefits of parallel computation. It covers various algorithms like linear and logistic regression, optimization methods, and the implementation of these techniques using Spark's MLlib. Additionally, it highlights the importance of data and model parallelism in handling big data challenges in machine learning.

Uploaded by

qabilamine30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views66 pages

Week3 Large Scale ML

Uploaded by

qabilamine30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Large-Scale

Machine Learning
CSCI316: Big Data Mining Techniques and Implementation
Contents

Gradient-based optimization in
machine learning

Parallel computation for large-scale

Spark’s machine learning library:

MLlib

2
Machine Learning Meets Big Data

3
Machine Learning Meets Big Data

4
How can ML benefit from parallelism
• In Lecture 8, we saw that common relational and algebraic operations
can benefit from parallel computation (in MapReduce and Spark)
• How about machine learning?
– Observation: Most machine learning algorithms involve a group of
mathematical operations called Gradient-based Optimisation and, in
particular, Gradient Descent.
• Hence, our objective is to see:
Ø What is gradient-based optimisation, and
Ø How can it benefit from parallel computation.

5
Recall What ML is.
• Ingredients of a learning algorithm [Goodfellow et al. 2016]
– Task (e.g., classification, regression, clustering, etc.)
– Performance measure (e.g., accuracy, MSE, etc.)
– Experience (e.g., dataset w/o class labels)
• For supervised learning, given a set of observations X with labels y,
learn a model ℳ that predicts the labels of X as good as possible in
items of some cost function.
– In the following, we assume that X is the output of pre-processing and
contains only numerical values.
– E.g., if the data includes categorical values, encoding applies first.

6
Optimisation in Machine Learning
• Most ML algorithms involves optimisation on a continuous (or piece-
wise continuous) function !(#) of some sort.
– That is, to find an instance of #∗ such that !(#∗) is minimum.
– Most often, ! is a cost function (such as the MSE) for the ML model
and # represent the model parameters that needs to be determined (by
training data)
– For simplicity of presentation, allows assume ! is continuous
• Decision tree regression models have piece-wise cont. cost functions
• If ! is also differentiable (or piece-wise differentiable) on # (again,
this is often the case), you can:
– in some special cases, exploit the algebraic property of ! to find a
closed-form expression of #∗ and ! #∗ , or
– more generally, apply the generic technique of gradient descent to
find &∗

7
Linear Regression
• The goal of linear regression is to build a linear function that maps the
data to the predicted values.
• Given an input vector ! in ℝ# , we define the output as %$ = '( !
where ' is a vector of parameters. Let % be the true value.
– Note that if we fix the first element in ! as 1, then we obtain the form %$ =
'′( !* + , where '′ is the weights and , is the bias
• For a set of observations, we have a -×/ martix 0 and a vector 1×/
vector 2 where / is the number of observations.
6
• The mean squared error (MSE) is 345 = ∑7 $9 − %9 )> =
9:6(%
7
6
2 − 2||>> .
||@
7

8
Linear Regression
• One nice property of linear regression is that
%
• To minimise the above MSE, i.e., min ) − )||++ , we have a closed
||(
$ &
0 2% 0
form solution $∗ = (/ /) / )
• In other words, if $ = $∗ , the MSE (mean squared error) is
minimum.
• In practice, it means that, for any given / and ) (if / is not very
large), we can compute the parameter vector quickly!

9
Logistic Regression
• Logistic regression builds a likelihood function (i.e., a logistic
function) that provides a probability that the data belongs to each
class.
• !(∑'$%& ($ )$ + +) = !(./ 0 + +)
– 0 is a tuple (i.e., record),
– ., + parameters
v here we segregate . and +.
34 &
– ! 2 = = = 1 − ! −2
&53 4 &53 64
• We model the probability of a label Y of some record with features 0
being 9 ∈ {−1, 1} as
1
= >=9 0 = B
1 + ? @A(. 05C)

10
Logistic Regression
• The goal is to seek suitable !, # that maximise the “overall”
probability for all $ records, i.e.
1 1
1
max )(!, #) ≔ - ) 2. = 4. 5. ) = - 9:; (!<5; =()
!,(
./0 ./0
1 + 8
• This equals to minimising the following log-likelihood function:
1
< 5 =()
min @(!, #) ≔ A log(1 + 8 9:;(! ; )
!,(
./0
• Unfortunately, there is no closed-form for the above problem…
• However, in practice, we can still solve it efficiently by using a
generic technique called gradient descent (and its variant).

11
Unconstrained Optimisation
• In general, we have a cost function ! which we would like to
minimise.
• Unconstrained optimisation involves finding minima of functions that
may have multiple inputs.
• min ! % = !(%∗ ) where % ∈ ℝ, , !: ℝ, → ℝ.
• Note that we deal with only minima because any maximum of ! is a
minimum of – !
• Because the shape of ! can be very complex, we do not always aim to
find the global minimum in ℝ, but instead points that take smaller
function values than all of their neighbours.
– A local minimum is good enough.

12
Local and Global Minima

13
A General Algorithm
• Iterative optimization algorithms consist of setting
initialization conditions, and then three iteration steps:
1. Initialize: Choose a starting point – an initial guess that can either
be determined by your situation or that can be actively chosen.
2. While a stopping criterion is not true (the solution is not close
enough to the minimum), continue, else break and return the
current solution.
3. Find a descent direction – a direction in which the function
value decreases near the current point.
4. Determine the step size – the length of a step in the given direction
that leads to a good decrease.
5. Go back to step 2 until while loop exits.

14
Local and Global Minima

15
Gradient
• For a function ! with multiple inputs " (i.e., multivariate functions),
we use partial derivatives to measure how much !(") changes as only
the variables %& increases at point ":
'!
(")
'%&
• The gradient generalises the notion od derivative to the case where the
derivative is with respect to a vector.
• The gradient of ! at " is the vector containing all the partial
derivatives, denoted by
'! '!
∇! " = ,…,
'%* '%-

16
Gradient Example
• What is the gradient of ! ", $, % = " − "$ + % ) ?
• Answer:

17
Stopping Criterion
• The minimum always has a certain property, the first derivative, i.e.,
&' &'
the gradient has to be zero: ∇" #∗ = ,…, =-
&() &(,
• The gradient can be a vector with two components, and the above
equation translates into the vector equation ∇" ., / =
(1"/1., 1"/1/)4 = (0,0)4 , i.e., each component of the gradient must
be zero.
• However, zero gradient is not a sufficient condition for achieving
minimum (or maximum)
• We can the points with zero gradient critical points.

18
Gradient Descent for MSE
%
• Goal: Find min ||($ − *||++ ⇔ min ||($ − *||++
$ & $
• That is, find $∗ that minimises ||($ − *||++
• Consider a univariate function (for some index . ∈ [1, 3]):
&

5 $ = ||$7 ([., : ] − *||++ = 9(=> (:) − @ (:) )+

:;%

19
Gradient Descent
Gradient Descent
• Start at a random point
• Repeat
•Start at a random
– Determine point
a descent direction
– Choose a step size
•Repeat
– Update
- Determine
• Until a descent
stopping criterion direction
is satisfied

error
- Choose a step size
- Update
•Until stopping criterion
is satisfied
weight

2017-11-08 ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 11/91
20
Small Steps Down the Gradient

21
Non-Convex Optimisation
• Both the linear regression (with a MSE cost function) and the logistic
regression are convex optimisation problems, i.e., the local minima
are the global minima.
• However, in general many ML models and cost functions result in
non-convex problems.
• Gradient descent still works, down to a local minimum.

22
Direction of Descent - Slope
• Know the error function, for example, ! " = "$%&ℎ( )
*+(-)
• Then, = 2 0 "$%&ℎ( ⇒ −2
*-
• Move in the opposite direction of the gradient.

23
Descent Direction and Magnitude (1D)
• The opposition direction of the slope points in the direction of steepest
error descent in weight space
step size
𝑑𝑓
𝑤𝑖+1 = 𝑤𝑖 − 𝛼𝑖 𝑤𝑖 ! refers to the i-th
𝑑𝑤
step (or epoch)
017-11-08 17/91
negative slope
• Step size is a free parameter that has to be chosen carefully for each
problem. It can be (and usually is) updated dynamically during the
iteration.

24
Step Size

$
• An example step size is !" =
% "
– where & is the number of training points and ' is the iteration step,
and ! is a constant.

25
Update Rules for MSE
• Now consider how to update all weights for the linear regression.
• Scalar objective:
-

! " = ||"% − '||(( = )("/ () − 1 () )(

*+,
• Derivative:
-
2!
" = 2 ) "/ (*) − 1 (*) / (*)
2"
*+,
• Scalar update:
-

"45, = "4 − 64 ) "4 / () − 1 () / (*)

*+,
• Vector update:
-

745, = 74 − 64 ) 784 %() − 1 () %(*)

*+,

26
Large-Scale ML and Parallel
Computation
Large-Scale ML
• Recall that ML updates a model ℳ based on data "
– For linear regression (with the MSE cost function, without
regularisation), the model is expressed by the vector #.
• Big data: the input data " is too large to hold in the main memory.
• Big model: the model ℳ is too large to hold in the main memory
• Data parallelism and model parallelism

Data parallelism Model parallelism

28
Gradient Descent: Big m, Big d
• Vector update: !"#$ = !" − '" ∑+ , ())
)*$ !" - − 0 ()) -())
• By data parallelism, we compute summands in parallel on workers
receiving all !" at every iteration.
• For example, let 1 = 6 and the number of works is 3:

Note. here n=m

• Bottleneck: the transfer of !" between the driver (or parameter server)
and the workers.

29
Gradient Descent: Big m, Huge d
• In linear regression, the data size (i.e., !×#) is much larger than the
model size (i.e., #). But in other ML algorithms, the model size can be
huge even compared with the data size.
– As an example, consider a polynomial regression model with only
two data records: $ % = ∑(,*+, -(,* . ( / *
– In deep learning, the number of parameters can reach billions.
• In this case, model parallelism separates the models across different
computing nodes and let different partitions of parameters be
processed by different CPUs and GPUs.
• Divide and Conquer: fully update each partition locally and minimise
the global communication.

30
Stochastic Gradient Descent (SGD)
• Recall gradient descent for linear regression (with MSE), every
iteration processes ! samples:
• "#$% = "# − (# ∑, - (*)
*+% "# . − 1 (*) .(*)
• The gradient is an expectation that can be approximated using a small
set of samples drawn uniformly at random from the whole data set.
– In particular, if the data set is highly redundant – the gradient in the
first half will be very similar to that in the second half.
• In SGD, we update the model with only one sample instead of !.
• SGD can improves the training speed and mitigate the large data issue
• Practice also shows that SGD helps the algorithm jump out of local
minima and find the global minima.

31
Minibatch Gradient Descent
• Increase the batch size from 1 to a smaller number than ".
• Divide the data set into small batches of examples, compute the
gradient using a single batch, makes an update, then move to the next
batch of examples.
– Computing the gradient simultaneously using the matrix-matrix
multiplications which are efficient, especially on GPUs
– For classification, ideally mini-batches need to be balanced for classes
(e.g., using stratified sampling)
• E.g., " = 10,
(.

%&'( = %& − *& + %/& 0(,) − 3 (,) 0(,)

,-(

32
Synchronous vs. Asynchronous GD
• In synchronous (minibatch) GD, the Spark Driver (or a parameter
server) will wait until all parallel workers have returned their updated
model before continuing to the next iteration.
– Fast workers wait for slow workers
– Can guarantee fault-tolerance
• In (parallel) asynchronous GD (or minibatch GD), a parameter server
will apply model updates from parallel workers immediately,
whereupon the work can immediately get new copy of the model to
work on a new mini-batch.
– Workers train the model concurrently on mini-batches without
blocking
– Needs to work with stale gradients

33
Beyond GD: Gradient-Based Optimisation
• First-order gradient-based optimisation:
– GD, stochastic GD and mini-batch GD
• Second-order gradient-based optimisation
– Newton’s method, which exploits the “gradient of gradient” (i.e.,
Hessian matrix) to determine the step size.
– Quasi Newton’s method, such as Limited-memory BFGS (LBFGS),
which approximates the “gradient of gradient” with less complexity.

34
Gradient-Based Optimisation in Spark
• The architecture of Spark is well-suited to data parallelism and
synchronise model update (including gradient-based model update).
– Model parallelism and asynchronies update are not really supported in
a standard Spark architecture
– Also, the model must fit in the Spark Driver’s memory.
– A parameter server can handle large models and asynchronies update
(currently a hot research topic)
v More information about the gradient-based optimisation methods
implemented in Spark ML:
• https://spark.apache.org/docs/latest/mllib-optimization.html
v Spark also use methods (e.g., LBFGS) from the Scala scientific
computing library Breeze
• https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/L
BFGS.scala

35
GD Implementation in Spark from scratch
• Consider the linear
regression problem
with big , and big
-:

training_rdd.cache() # training_rdd is an RDD including the data of feature

matrix ! and target vector "; we persist it across iterations
d = len(training_rdd.take(1)); w = np.zeros(d-1)
for i in range(numIters):
gradient = training_rdd.map(lambda rec: udf(w, rec)).sum()
# rec is the pair #, % from ! and ", resp.
# udf(w, rec) is a user-defined function that produces ('( # − %)#
# we pass udf() to the map transformation; sum() is the reduce action
alpha_i = alpha / (m * np.sqrt(i+1)); w -= alpha_i * gradient
v Note. In general, the udf is defined w.r.t. the lost function +(').
36
Spark MLlib
End-to-End Project in Spark

38
The Process
• The key steps in the process of an end-to-end project:
1. Gathering and collecting the relevant data for your task.
2. Cleaning and inspecting the data to better understand it.
3. Performing feature engineering to allow the algorithm to leverage
the data in a suitable form (e.g., converting the data to numerical
vectors).
4. Using a portion of this data as a training set to train one or more
algorithms to generate some candidate models.
5. Evaluating and comparing models by objectively measuring results
on a subset of the same data that was not used for training.
6. Leveraging the insights from the above process and/or using the
model to make predictions, detect anomalies, or solve more general
business challenges.

39
High-Level MLlib Concepts

40
High-Level MLlib Concepts
• Transformers
– functions that convert raw data in some way.
– E.g., to create a new interaction variable, convert string categorical
values into numerical values
– primarily used in pre-processing and feature engineering
– takes a DataFrame as input and produces a new DataFrame as output

41
High-Level MLlib Concepts
• Estimators
– if provided with data, result in transformers
– algorithms that are used to train models
• Evaluators
– evaluate how a given model performs according to criteria (e.g.,
accuracy, ROC)
• Pipeline
– MLlib’s highest-level data type
– Transformers, estimators and evaluators are all stages in a
pipeline
– Similar to Scikit-Learn’s pipeline API
• Low-Level Data Types
– Vectors and Matrices (local vs distributed, dense vs sparse)

42
MLlib in Action
• Example 1: a (synthetic) dataset
df = spark.read.json(dataset)
df.orderBy("value2").show(5)
+-----+----+------+------------------+
|color| lab|value1| value2|
+-----+----+------+------------------+
| red|good| 35|14.386294994851129|
| blue| bad| 12|14.386294994851129|
| red| bad| 2|14.386294994851129|
| blue| bad| 8|14.386294994851129|
| red| bad| 16|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows
q a categorical label with two values (good or bad), a categorical
variable (colour), and two numerical variables.

43
MLlib in Action
• Feature Engineering with Transformers
– As mentioned, transformers manipulate existing columns in and add
new columns to a DataFrame
– In MLlib, all inputs to ML algorithms in Spark must consist of type
Double (for labels) and Vector[Double] (for features).
• Note. our synthetic dataset does not meet this requirement
– RFormula: a declarative language for specifying ML transformers and
is simple to use (supports a limited subset of the R operators):
~ Separate target and terms
+ Contact terms
- Remove terms
: Interaction
. All columns except the target/dependent variable

44
MLlib in Action
• In our case, we use all variables and also add in the interactions
between some columns.
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . + color:value1 +
color:value2")
• The next step is to fit and apply the RFormula transformer to data
– Just call the fit and transform methods of an RFormula instance:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show(4)
+-----+----+------+------------------+--------------------+-----+
|color| lab|value1| value2| features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good| 1|14.386294994851129|(10,[1,2,3,5,8],[...| 1.0|
| blue| bad| 8|14.386294994851129|(10,[2,3,6,9],[8....| 0.0|
+-----+----+------+------------------+--------------------+-----+

45
MLlib in Action
• Look in the transformer output:
preparedDF.select("features").show(2,False)
+--------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------+
|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])|
|(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]) |
+--------------------------------------------------------------------+

– Behind the scenes, RFormula inspects the data during the fit call and
creates an RFormula model
type(fittedRF)
Out: pyspark.ml.feature.RFormulaModel
– When applying this model to transform the DF, Spark converts the
categorical values into Doubles and creates additional features for
interactions between color and value1/value2

46
Logistic Regression with Regularization
• Example: Will someone have a heart attack over the next year?

• Classification: Yes/No (i.e., 1 or − 1)

• Logistic Regression: Likelihood of heart attack, given by
%(∑+()* ,( -( + /) = %(23 4 + /)
– 4 is a tuple (i.e., record),
– 2, / parameters
78 *
– % 6 = = = 1 − % −6
*97 8 *97 :8

47
Logistic Regression with Regularization
• The goal is to maximize the following log-likelihood function:
89: ($;<: =&)
max '($, )) ≔ ∑0 -./ −log(1 + 7 ) (see previous slides)
$,&
• In practice, we usually consider adding a regularization term
max '($, >) + ?@($)
$,&
– where ? ≥ 0,
G
– @ C = E| $| G + (1 − E)| $ |/
G
• with | $| G the Euclidean norm and | $ |/ the Manhattan distance.
– E ∈ [0,1] a coefficient (e.g. an elastic net parameter)
• This extra term helps prevent overfitting the LR model.
• ? and E are hyperparameters of the logistic regression classifier.
v Variants and more sophisticate versions of LR are implemented in
Spark MLlib.

48
MLlib in Action
• Create a simple test set based on a sample split of the data
train, test = preparedDF.randomSplit([0.7, 0.3])
• Fit a model (logistic regression classifier)
– We can simply instantiate a LogisticRegression model, using the
default configuration or hyperparameters.
– set the label columns and the feature columns; the column names—
label and features—are actually the default labels.
from pyspark.ml.classification
import LogisticRegression
lr = LogisticRegression(labelCol="label",
featuresCol="features")
– We can kick off a Spark job to train the model by the following:
fittedLR = lr.fit(train)

49
MLlib in Action
– Apply the model to the training dataset and see the prediction:
fittedLR.transform(train).select("label", "prediction").show()
+-----+----------+
|label|prediction|
+-----+----------+
| 0.0| 0.0|
| ...| ...|
| 1.0| 1.0|
| ...| ...|

– Next, we can evaluate this model and calculate the performance metrics
(e.g., TP rate and FN rate)
• In practice, we also need to try out difference combinations of model
hyperparameters
• call lr.explainParams() to retrieve a list of hyperparameters
– The Pipeline interface saves the manual effort on model selection and
hyperparameter tuning.

50
MLlib in Action
• Pipelining steps in an advanced analytics workflow
– The Pipeline interface allows to set up a dataflow of a sequence of
related operations that ends with an estimator.

51
MLlib in Action
Pipeline as an Estimator:

Pipeline as a Transformer:

52
MLlib in Action
• To build a pipeline, first split the data based on the original dataset df
(not the preparedDF used before)
train, test = df.randomSplit([0.7, 0.3])
• Recall that each stage in a Pipeline is either a transformer or an
estimator.
• There are two estimators in our case: for RFormula and the logistic
regression classifier.
rForm = RFormula()
lr = LogisticRegression()\
.setLabelCol("label").setFeaturesCol("features")
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)
o A “logical” pipeline is built in the last step.

53
MLlib in Action
• Training and Evaluation
– Train several variations of the model by specifying different
combinations of the hyperparameters
– MLlib provides a ParamGridBuilder class for this purpose:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
.addGrid(rForm.formula, [
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2"])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.addGrid(lr.regParam, [0.1, 2.0])\
.build()
q In the above example code, we have selected:
• 2 versions of RFormula
• 3 options for the ElasticNet parameter
• 2 options for the regularization parameters

54
MLlib in Action
• Thus, we want to evaluate a total of 12 different combinations of
parameters
• To determine which combination is optimal, we select an evaluator
and an evaluation metric.
– In this case, BinaryClassificationEvaluator and the areaUnderROC
metric.
from pyspark.ml.evaluation import
BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
.setMetricName("areaUnderROC")\
.setRawPredictionCol("prediction")\
.setLabelCol("label")
• To actually perform the evaluation that determines a best model that
we train, some kind of validation is needed
– which is different from the holdout test dataset (in order to avoid
overfitting).

55
MLlib in Action
• Dataset splits:
training
training
all data validation
testing
– The validation and testing usually can share common methods.
• split the training dataset into two different groups (used below)
• perform K-fold cross-validation, etc.
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\ #how to split training set
.setEstimatorParamMaps(params)\
.setEstimator(pipeline)\
.setEvaluator(evaluator)

56
MLlib in Action
• We are ready to train the model:
tvsFitted = tvs.fit(train)
type(tvsFitted)
Out: pyspark.ml.tuning.TrainValidationSplitModel
• Finally, we can test it with our holdout dataset
evaluator.evaluate(tvsFitted.transform(test))
Out: 0.9523809523809523 # may be different each time
Alternatively:
evaluator.evaluate(tvsFitted.bestModel.transform(test))
Out: 0.9523809523809523 # may be different each time
Check out models in the pipeline:
tvsFitted.bestModel
Out: PipelineModel_483685a888d17d310e60
tvsFitted.bestModel.stages
Out: [RFormula_4a689a, LogisticRegression_44dcb9a]

57
MLlib in Action
• Usually, a single pipeline in Spark includes one ML algorithm (as an
estimator).
– If you want to imply multiple competing ML algorithms (e.g., a DT
classifier and a LR classifier), you may need to specify multiple
pipeline manually.
• Persisting and Applying Models
– To save the model (to facilitate future usage),
tvsFitted.bestModel.write().save(“path...”)
– To load the model
from pyspark.ml.pipeline import PipelineModel
myModel = PipelineModel.load(“path...”)

58
MLlib in Action
• Model Deployment

59
Deployment Patterns
• Train an ML model offline and then supply it with offline data.
• Train a model offline and then put the results into a database (usually
a key-value store).
• Train an ML algorithm offline, persist the model to disk, and then use
that for serving.
• Manually (or via some other software) convert a distributed model to
one that can run much more quickly on a single machine.
• Train an ML algorithm online and use it online (e.g., in conjunction
with Spark’s streaming modules)

60
Pipeline Example (Decision Tree)
• Dataset: a file sample_libsvm_data.txt in the LIBSVM format

– Each recorded is in the follow format

<label> <index1>:<value1> <index2>:<value2> ...
– Records are separated ‘\n’.

61
Pipeline Example (Decision Tree)
# https://spark.apache.org/docs/latest/ml-classification-
regression.html#decision-tree-classifier
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm")\
.load("../sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4
# distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features",
outputCol="indexedFeatures", maxCategories=4).fit(data)

62
Pipeline Example (Decision Tree)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)

63
Pipeline Example (Decision Tree)
• Decision Tree model hyperparameters
– maxDepth
– maxBins
continuous features are converted into categorical features and maxBins
determines how many bins should be created from continuous features. The
default is 32.
– impurity
entropy or gini (default)
– minInfoGain
– minInstancePerNode

64
Summary
• Gradient-based optimisation (e.g., gradient descent)
– An important mathematical tool, widely adopted in training popular
ML models
• Large-scale ML: model parallelism vs. data parallelism
– Implementation of gradient-descent in Spark
• Spark MLlib
– High-level concepts: Transformer, estimator, pipeline
– Examples: logistic regression, decision tree

65
Questions?

Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
No ratings yet
Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
202 pages
Empowerment Technology: Quarter 1 - Module 1
100% (3)
Empowerment Technology: Quarter 1 - Module 1
20 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
ML Notes
No ratings yet
ML Notes
14 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Machine Learning 45 A 87
No ratings yet
Machine Learning 45 A 87
43 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Module 3
No ratings yet
Module 3
27 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Unit 2
No ratings yet
Unit 2
76 pages
ML Intro
No ratings yet
ML Intro
5 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Optim ML
No ratings yet
Optim ML
41 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
Gradient Descent for Beginners
No ratings yet
Gradient Descent for Beginners
15 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
ML - Mca
No ratings yet
ML - Mca
48 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
CATIA Composer Installation Configuration and Licensing Guide R2015x
No ratings yet
CATIA Composer Installation Configuration and Licensing Guide R2015x
28 pages
1sdh002031a1101 C
No ratings yet
1sdh002031a1101 C
36 pages
Iso File Naming Macro
No ratings yet
Iso File Naming Macro
6 pages
Keeping OpenShift Evergreen
No ratings yet
Keeping OpenShift Evergreen
37 pages
Sample Size Calculation: Statistical Analysis With Software Application
No ratings yet
Sample Size Calculation: Statistical Analysis With Software Application
20 pages
Brave MMA Event Expenses 2016
No ratings yet
Brave MMA Event Expenses 2016
18 pages
Online Handwriting Recognition by Using Microcontroller
No ratings yet
Online Handwriting Recognition by Using Microcontroller
93 pages
Homework List Template
100% (1)
Homework List Template
5 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
Windows User Account Management Lab
No ratings yet
Windows User Account Management Lab
3 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
25 pages
The Handbook To Setting Up A Modern SSC
No ratings yet
The Handbook To Setting Up A Modern SSC
24 pages
User Manual: Journal Article Latex Authoring Template
No ratings yet
User Manual: Journal Article Latex Authoring Template
14 pages
25 Zero Investment Business Ideas
No ratings yet
25 Zero Investment Business Ideas
109 pages
2-3 - The Serial Monitor
No ratings yet
2-3 - The Serial Monitor
10 pages
Current Midterm Solved Papers: Muhammad Faisal Dar
No ratings yet
Current Midterm Solved Papers: Muhammad Faisal Dar
14 pages
Untitled Paste
No ratings yet
Untitled Paste
16 pages
Techblume, Inc Company Profile
No ratings yet
Techblume, Inc Company Profile
14 pages
Dbaudio Technical Information Ti370 1.1 en
No ratings yet
Dbaudio Technical Information Ti370 1.1 en
18 pages
LoRa SDR Tool for Satellite IoT
No ratings yet
LoRa SDR Tool for Satellite IoT
6 pages
Installar Notes
No ratings yet
Installar Notes
42 pages
PDF g5 Ta1 Grupal Ergonomia
No ratings yet
PDF g5 Ta1 Grupal Ergonomia
14 pages
MulesoftDevLead-Resume
No ratings yet
MulesoftDevLead-Resume
11 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
7 pages
18-Dial's Algorithm-06-02-2025
No ratings yet
18-Dial's Algorithm-06-02-2025
24 pages
Session 1 and 2 Course Overview and Intro To R
No ratings yet
Session 1 and 2 Course Overview and Intro To R
147 pages
Swayam 8thmajor
No ratings yet
Swayam 8thmajor
57 pages
Case Study On Systems Consideration in Hris
No ratings yet
Case Study On Systems Consideration in Hris
10 pages

Week3 Large Scale ML

Uploaded by

Week3 Large Scale ML

Uploaded by

Large-Scale

Parallel computation for large-scale

Spark’s machine learning library:

5 $ = ||$7 ([., : ] − *||++ = 9(=> (:) − @ (:) )+

! " = ||"% − '||(( = )("/ (*) − 1 (*) )(

"45, = "4 − 64 ) "4 / (*) − 1 (*) / (*)

745, = 74 − 64 ) 784 %(*) − 1 (*) %(*)

Data parallelism Model parallelism

Note. here n=m

%&'( = %& − *& + %/& 0(,) − 3 (,) 0(,)

training_rdd.cache() # training_rdd is an RDD including the data of feature

• Classification: Yes/No (i.e., 1 or − 1)

– Each recorded is in the follow format

You might also like

! " = ||"% − '||(( = )("/ () − 1 () )(

"45, = "4 − 64 ) "4 / () − 1 () / (*)

745, = 74 − 64 ) 784 %() − 1 () %(*)