Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
53 views66 pages

Full Text 01

Uploaded by

Malika Achouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views66 pages

Full Text 01

Uploaded by

Malika Achouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2020

LIBS Multivariate Analysis


with Machine Learning
OLIVIER NICOLINI

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
LIBS Multivariate Analysis
with Machine Learning

OLIVIER NICOLINI

Master in Computer Science


Date: August 3, 2020
Supervisor: Ying Liu
Examiner: Saikat Chatterjee
School of Electrical Engineering and Computer Science
Host company: Swerim
Swedish title: LIBS multivariat analys med maskininlärning
Abstract
Laser-Induced Breakdown Spectroscopy (LIBS) is a spectroscopic technique
used for chemical analysis of materials. By analyzing the spectrum obtained
with this technique it is possible to understand the chemical composition of a
sample. The possibility to analyze materials in a contactless and online fash-
ion, without sample preparation make LIBS one of the most interesting tech-
niques for chemical composition analysis. However, despite its intrinsic ad-
vantages, LIBS analysis suffers from poor accuracy and limited reproducibility
of the results due to interference effects caused by the chemical composition
of the sample or other experimental factors. How to improve the accuracy of
the analysis by extracting useful information from LIBS high dimensionality
data remains the main challenge of this technique. In the present work, with
the purpose to propose a robust analysis method, I present a pipeline for mul-
tivariate regression on LIBS data composed of preprocessing, feature selec-
tion, and regression. First raw data is preprocessed by application of intensity
filtering, normalization and baseline correction to mitigate the effect of inter-
ference factors such as laser energy fluctuations or the presence of baseline in
the spectrum. Feature selection allows finding the most informative lines for
an element that are then used as input in the subsequent regression phase to
predict the element concentration. Partial Least Squares (PLS) and Elastic Net
showed the best predictive ability among the regression methods investigated,
while Interval PLS (iPLS) and Iterative Predictor Weighting PLS (IPW-PLS)
proved to be the best feature selection algorithms for this type of data. By
applying these feature selection algorithms on the full LIBS spectrum before
regression with PLS or Elastic Net it is possible to get accurate predictions in
a robust fashion.

iii
Sammanfattning
Laser-Induced Breakdown Spectroscopy (LIBS) är en spektroskopisk teknik
som används för kemisk analys av material. Genom att analysera det spektrum
som erhållits med denna teknik är det möjligt att förstå den kemiska samman-
sättningen av ett prov. Möjligheten att analysera material på ett kontaktlöst och
online sätt utan förberedelse av prov gör LIBS till en av de mest intressanta
teknikerna för kemisk sammansättning analys. Trots dess inneboende fördelar
lider LIBS-analysen av dålig noggrannhet och begränsad reproducerbarhet av
resultaten på grund av interferenseffekter orsakade av provets kemiska sam-
mansättning eller andra experimentella faktorer. Hur man kan förbättra analy-
sens noggrannhet genom att extrahera användbar information från LIBS-data
med hög dimensionering är fortfarande den största utmaningen med denna
teknik. I det nuvarande arbetet, med syftet att föreslå en robust analysmetod,
presenterar jag en pipeline för multivariat regression på LIBS-data som består
av förbehandling, val av funktioner och regression. Första rådata förbehandlas
genom tillämpning av intensitetsfiltrering, normalisering och baslinjekorrek-
tion för att mildra effekten av interferensfaktorer såsom laserens energifluktu-
ationer eller närvaron av baslinjen i spektrumet. Funktionsval gör det möjligt
att hitta de mest informativa linjerna för ett element som sedan används som
input i den efterföljande regressionsfasen för att förutsäga elementkoncentra-
tionen. Partial Least Squares (PLS) och Elastic Net visade den bästa förut-
sägelseförmågan bland de undersökta regressionsmetoderna, medan Interval
PLS (iPLS) och Iterative Predictor Weighting PLS (IPW-PLS) visade sig vara
de bästa funktionsval algoritmerna för denna typ av data. Genom att tillämpa
dessa funktionsvalalgoritmer på hela LIBS-spektrumet före regression med
PLS eller Elastic Net är det möjligt att få exakta förutsägelser på ett robust
sätt.

iv
Contents

1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . 6
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Denoising . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . 7
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 9
2.1 Univariate calibration analysis . . . . . . . . . . . . . . . . . 9
2.2 Multivariate calibration analysis . . . . . . . . . . . . . . . . 10
2.2.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . 11
2.2.2 Principal Component Regression . . . . . . . . . . . . 12
2.2.3 Partial Least Squares . . . . . . . . . . . . . . . . . . 13
2.2.4 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Random Forest . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Kernel Methods . . . . . . . . . . . . . . . . . . . . . 16
2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Baseline Correction . . . . . . . . . . . . . . . . . . . 17
2.4 Feature Selection Methods . . . . . . . . . . . . . . . . . . . 18
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Least Squares Baseline Correction . . . . . . . . . . . 19

3 Methods 20
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Intensity filtering . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Baseline Correction . . . . . . . . . . . . . . . . . . . 25
3.4.3 Normalization . . . . . . . . . . . . . . . . . . . . . 27
3.5 Full spectrum multivariate analysis . . . . . . . . . . . . . . . 27
3.6 Feature selection methods . . . . . . . . . . . . . . . . . . . . 28

4 Results and Discussion 32


4.1 Full spectrum results . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Feature selection results . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusions 43
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 46

A Dataset element concentration 53

vi
Chapter 1

Introduction

Laser Induced Breakdown Spectroscopy (LIBS) is a spectroscopic technique


used for chemical analysis of materials. This is done by focusing consecutive
short laser pulses over a sample. The laser ablates a small portion of material
and generates a high temperature micro-plasma on the sample surface. During
the plasma cooldown phase, due to their unique electronic configuration, the
various elements present in the plasma (excited atoms and ions) emit light at
specific wavelengths as they return to their ground state. This light emission
is then collected by a spectrometer. A visual representation of a typical LIBS
system can be observed in Figure 1.1.
Since the wavelengths at which emitted light is collected constitute the fin-
gerprint of the elements contained in the sample, it is possible to determine if
an element is present or not. Figure 1.2 shows an example where the spectra
of two different samples are compared. There it can be observed that, even
if spectra come from different samples, each element has its corresponding
emission peaks always at the same wavelengths.
Moreover, as the element peak intensities are related to its concentration in the
sample, it is also possible to perform a regression analysis and determine the
element concentration in the sample, usually expressed as a percentage. In the
analytical chemistry field this type of regression analysis is called quantitative
analysis.
LIBS has been applied in many different fields such as industrial manu-
facturing [2], [3], food analysis [4], soil analysis [5], and for space exploration
[6].
At Swerim, this technique is used to analyze the chemical composition of met-
als or slag materials, which are by-products created during steel production. It
is used to perform quality checks of either raw materials or finished products,

1
Figure 1.1: Schematics of a LIBS system.

or as a process monitoring tool during the production of new materials, for ex-
ample in order to control that the correct steel or aluminium alloy is obtained
and adjust the process if needed.
The main reasons for which LIBS should be preferred over other chemical
analysis techniques such as X-Ray Fluorescence (XRF) come from the fact
that LIBS analysis requires no sample preparation, and that measurements can
be done in situ and in a contactless fashion, delivering results in real-time.
Other advantages of this technique are the ability to analyze samples in any
physical state (solid, liquid or gas) in a minimally destructive manner and from
the possibility to analyze multiple elements simultaneously.
However, despite of all these advantages, LIBS suffers from poor accuracy
and limited reproducibility of results. This is due to uncontrolled atmosphere
around the targeted sample and variations in other experimental parameters
such as sample inhomogeneity, fluctuations in laser energy. the so called ma-
trix effects, i.e. the effects that other components present in the sample have
on the analysis of the examined element, and self-absorption, a phenomenon
where the light emitted by excited species (atoms or ions) is reabsorbed by the
atoms or ions in their ground state, and hence it is not collected by the spec-
trometer.
Due to these factors, it is also often the case that the collected LIBS spectra
are not of great quality because of light intensity fluctuations across measure-
ments, low signal-to-noise ratio (SNR) or the presence of baseline, a shift in
amplitude that changes across different wavelengths.
These factors together make the analysis of the spectra unreliable and inaccu-
rate without the application of some preprocessing steps such as normalization

2
Figure 1.2: Spectra from two different samples are showed. Even if the sample
compositions are different each element have its characteristic peaks always at
the same wavelengths. Reprinted from [1].

and baseline correction before the regression phase. This is especially crucial
in the case of quantitative analysis as in general it is more difficult to identify
the concentration of an element than to just determine if it is present or not
inside a sample.
According to literature the most common preprocessing steps needed to reduce
interference factors are:

• Normalization [6]–[12]

• Baseline correction [6], [8], [11]

• Denoising [8]

After the preprocessing phase, univariate or multivariate methods are ap-


plied to get the concentration of the element of interest. These methods are
based on a calibration curve that relates the element concentration and the in-
tensity of LIBS signal. Once this curve is established, it is then possible to
calculate the concentration of unknown samples measured under the same cir-
cumstances. Methods that use just one variable, i.e. just one of the spectral

3
intensity lines of the element of interest, are called univariate methods, while
multivariate methods use multiple emission lines up to the whole spectrum.
The classical univariate approach requires one to identify one of the in-
tensity peaks in the spectrum for the element of interest and then form the so
called calibration curve between the emission intensity line and the elemental
concentration.
In order to know at which wavelengths the element peaks in a spectra can be
found, one consults either NIST [13] or AtomTrace [14] databases. However,
the univariate technique is not well suited for quantitative analysis of complex
samples where matrix effects and self-absorption phenomena are present since
they affect negatively the quality of the obtained results.
Multivariate analysis can reduce the influence of matrix effects over predic-
tions by taking advantage of the information contained in the whole spectrum,
instead of using just one emission line to predict the analyte concentration, and
therefore improve the accuracy of the analysis.
With the rise of machine learning algorithms that finds applications across
many different areas, one should not be surprised that machine learning meth-
ods have also been applied to LIBS data in the multivariate approach. Across
the years many regression methods have been used for quantitative analy-
sis, such as Principal Components Regression (PCR) [15], [16], Partial Least
Squares (PLS) [5], [11], [12], [15]–[17], Lasso [17], Elastic Net [5], [16],
Random Forest (RF) [18], [19], and kernel methods [16]. In more recent
years there have been some attempts to use deep learning models such as Multi
Layer Perceptron (MLP) [10], [20] and Convolution Neural Network (CNN)
[21] models.
However, as the average recorded spectra are composed of thousands of
emission lines, using the whole spectrum for multivariate quantitative analysis
can still result in inaccurate predictions as some spectral regions may introduce
sources of error [22]. It has been shown [23] that better predictions can be
achieved using only a small fraction of these lines by using variable selection
methods.
In the present work, I propose to demonstrate which are the preprocess-
ing steps needed to improve the quality of LIBS raw data. After that, which
multivariate methods are most effective with this kind of data by first using the
whole spectrum as input. To further improve the prediction quality, feature se-
lection methods such as Genetic Algorithm PLS (GA-PLS) [24], Interval PLS
(iPLS) [25] and Iterative Predictor Weighting PLS (IPW-PLS) [26] will be
used to find an optimal subset of emission lines. This allowed the creation of a
pipeline composed of preprocessing, variable selection and finally regression.

4
The first step consisted in enhancing the quality of data by the application of
four preprocessing steps: intensity filtering, baseline correction, normaliza-
tion and averaging. After preprocessing, feature selection methods were used
to find an optimal subset of variables to be then used for fitting regression
models and get more accurate predictions.

1.1 Problem
Over the last two decades, due to the previously stated advantages of LIBS
over other chemical analysis techniques, there has been an increasing interest
from the scientific community to improve the accuracy of LIBS predictions.
This has been done by either improving the physical instruments (laser, spec-
trometer, etc.) or by improving the post-experiment analysis.
In the latter case, the post-experiment analysis is often done manually by an ex-
pert with a chemical background and requires a considerable amount of time.
Furthermore, the intrinsic limitations of LIBS, such as the poor accuracy and
the limited reproducibility of results, make predictions unreliable, especially
in the case of complex spectra.
With the rise of the various machine learning methods that find applica-
tions across many different fields there has been an increasing interest to apply
these techniques to LIBS data. However, even with machine learning tech-
niques, the analyst must be careful and be able to understand which are the
steps and methods needed to get robust results.
To do this, several questions need to be answered: which are the preprocess-
ing steps needed to improve the quality of data? Which is the best multivariate
method for working on LIBS data? Is it better to use all the spectral informa-
tion in the spectra or just a subset of emission lines? In the latter case, how to
find the most informative variables in the spectrum to be used as input to the
regression method and further improve predictions?

1.2 Goal
The goal of this study is to find the best regression method(s) for multivariate
quantitative analysis on LIBS data and which are the preprocessing steps and
the feature selection methods needed to yield the best possible predictions in
a way that requires the least possible intervention from an analyst.

5
1.3 Benefits, Ethics and Sustainability
The main benefit that this thesis brings is the possibility to discover a new
approach to LIBS quantitative analysis by bringing into play methods that can
help get predictions of higher quality and in less time. Another asset of this
research is the possibility to use the new approach to discover less common
elemental peaks or windows of emission lines that can improve the predictions
also if used in a more classical univariate approach.
The improvement in LIBS predictions quality could increase the use of
this non-destructive technique in new and different applications. Also the in-
creased use of this technique could bring to the creation of bigger datasets data
that in turn could allow the introduction of deep learning techniques for LIBS
analysis, possibly further increasing the quality of predictions.
From a sustainability point of view, the use of LIBS as monitoring tool
would ensure a better control in the production of metal and slag pieces.
This in turn would avoid unnecessary wastes of energy and raw materials and
introduce the possibility to use by-products of industrial processes such as slag
in other industrial applications once their chemical composition is known.
This would help in the creation of a greener and more circular economy and
be in accordance with the United Nations’ Sustainable Development Goals
(SDGs) [27]. In particular this would be in line with the 12th goal of this
agenda: "Responsible consumption and production" [28] by making and ef-
ficient use of natural resources and reduce waste generation by reusing and
recycling.

1.4 Methodology
This study followed a deductive quantitative approach by first studying liter-
ature in order to find out which were the best multivariate approach used for
quantitative analysis on LIBS spectra. A divide-and-conquer approach was
followed while verifying the hypothesis. Specifically, the initial hypothesis
was divided and expanded to multiple sub-hypothesis that were then verified
to support the initial hypothesis. Hypothesis were tested with the implemen-
tation of a regression pipeline and results were analyzed quantitatively.

6
1.5 Delimitations
1.5.1 Denoising
Denoising methods were not investigated in this study since the LIBS experi-
mental setting already yielded signals with little noise.

1.5.2 Deep Learning


Deep Learning is a subfield of machine learning concerned with algorithms
that take inspiration from the human brain which are composed of several lay-
ers between input and output.
The main difference between classical machine learning and deep learning is
that, in the former framework, one selects important features and then feeds
them into a model; in a deep learning framework, the model learns both the
important features from raw data and at the same time is able to perform clas-
sification or regression. In order to do this, deep learning models require huge
datasets, which are usually composed of at least few thousands samples.
Deep learning models have been used in many fields of expertise like image
and video processing, natural language processing and many more [29].
Some attempts have also been made to apply deep neural networks to spectral
data [10], [20], [21]. However, papers using deep learning models on this type
of data claim state-of-the-art performance but often lack the amount of data
needed for these kind of networks. Therefore, the high accuracy results are
normally just due to heavy overfitting.
As in many other fields, the lack of data is the main problem, as the theoretical
assumptions may not be wrong, but the way experiments are carried invalid
the scientific claims.
Also in this research lack of data was the cause for which deep models were
not used, as the available datasets had too little sample size.

1.6 Outline
In Chapter 2 the theoretical background univariate and multivariate methods
will be discussed. The various regression methods used in this work will be
discussed from a mathematical point of view. Then normalization and baseline
correction methods will be presented together with the related work. Finally
a short introduction to feature selection methods and their use in LIBS.
In Chapter 3 the datasets, the metrics and the evaluation procedure will be

7
presented. The various preprocessing steps used are then explained followed
by the description of how the regression methods are implemented in the full
spectrum scenario. Finally the feature selection methods used are discussed.
In Chapter 4 the results of the experiments in full spectrum scenario and the
results of regression methods after using feature selection methods are pre-
sented and discussed.
In Chapter 5 the conclusions and possible future work are discussed.

8
Chapter 2

Background

There are two main quantitative approaches to analyse samples with LIBS:
calibration methods and the so called CF-LIBS (calibration-free LIBS).
Calibration methods are based on a calibration curve that relates the ele-
ment concentration and the intensity of LIBS signal. In order to do this, the
elemental composition of some reference samples is required. The reference
sample composition normally comes from the analysis of the samples using
another chemical analysis technique such as X-Ray Fluorescence (XRF) anal-
ysis. Once the calibration curve is established, it is then possible to analyze
unknown samples and determine the concentration of a particular element.
As mentioned in Chapter 1, calibration methods can be either univariate where
just one of the spectral intensity lines, i.e. one variable, is used, or multivari-
ate. Multivariate methods use instead multiple emission lines up to the whole
spectrum.
In order to avoid the need for calibration curves and requiring some ref-
erence sample concentration by using XRF or other chemical analysis tech-
niques, CF-LIBS analysis has been developed by Ciucci [30] to determine
chemical composition by accounting for physical and chemical matrix effects
theoretically through analysis of the spectrum. The problem of CF-LIBS is
that this technique is based on several physical and experimental assumptions
that need to be satisfied to correctly determine chemical composition and this
is not always possible.

2.1 Univariate calibration analysis


This has been the classical approach for regression analysis of LIBS spectra
for many years and for this reason this method has been called also standard

9
calibration method. Univariate calibration analysis takes advantage of the fact
that the element characteristic line intensity in the spectrum is proportional to
the element concentration in the sample, when there is no self-absorption. In
presence of self-absorption effects, a quadratic relation may be observed.
This method requires to assign each peak of the LIBS spectra to the corre-
sponding element in agreement with the NIST or AtomTrace databases. Once
the various peaks are paired with their corresponding element, a calibration
curve can be established by relating the element concentration and the inten-
sity of one of the element peaks. After establishing this curve, it is possible to
calculate the element concentration of unknown samples.
However, common practice is to use the so called "ratio method" [31], where
the ratio between the peak of the element of interest and one peak of the matrix
element are used to form the calibration curve instead of just using one of the
emission lines of an element. This second approach allows to normalize the
spectra and correct for energy fluctuations of the laser pulses between shots.
The main threat to this methods comes from the spectral interference between
different elements in the same region which introduce non linearities. By re-
lying on just one line, this method is not robust enough in regions when there
are several different elements that could have generated a certain peak, as illus-
trated by Figure 2.1. For example, the quantitative analysis on minor elements
in steels is usually complicated by spectral interference of iron caused by its
high number of intensity lines in the spectrum [32].
Therefore, the univariate calibration is not well suited for quantitative anal-
ysis of such complex samples since, when interference effects and self-absorption
phenomena are present, the quality of the obtained result will be negatively af-
fected.
This is why multivariate techniques have gained more and more attention in the
field since they can mitigate spectral interference effects and how they affect
the analysis of LIBS spectra by making use of multiple emission lines.

2.2 Multivariate calibration analysis


As mentioned in the previous section, the main problem of the univariate cal-
ibration method is that matrix effects, self-absorption and other interference
factors can affect the analysis of the element of interest by introducing non-
linearities that make the prediction on a single emission line unreliable. Mul-
tivariate analysis can reduce the effect of these factors by taking advantage of
the information contained in the whole spectra and improve the accuracy of
the analysis. Multivariate methods have proven to be better than the univariate

10
Figure 2.1: Possible elements to be attributed to the emission peaks in the
spectral window between 282 and 296 nm. Next to each element is the char-
acteristic wavelength of emission in nm.

one since they can exploit all the information present in a spectra [11], [12].
Across the years many regression techniques have been used for quantita-
tive analysis, such as Principal Components Regression (PCR), Partial Least
Squares (PLS), Lasso, Elastic Net, Random Forest (RF) and Kernel trick meth-
ods. In more recent years there also have been some attempts to use deep learn-
ing methods such as Multi Layer Perceptron (MLP) and Convolution Neural
Network (CNN) models on spectroscopic data.

2.2.1 Ordinary Least Squares


Least squares [33] is a "family" of regression analysis techniques used to ap-
proximate the solution of overdetermined systems by minimizing the sum of
squared residuals, i.e. the difference between an observed value, and the pre-
dicted value provided by a model.
The simplest of this family of methods, Ordinary Least Squares (OLS) regres-
sion, often called just linear regression, is a statistical method that finds the
coefficients β = (β1 , ... , βp ) of the linear function Y = Xβ, where X is the
independent variable of dimensions n × p and Y is the dependent variable of
dimensions n × k.
This is done by minimizing the sum of the squares of the difference between

11
the dependent variable Y and the predicted target

min ||Xβ − Y ||22


β

Assuming that the p columns, i.e. the features, of the matrix X are linearly
independent, this minimization problem has a unique solution which is found
from the equation
(X T X)β = X T Y
which gives as a solution

β = (X T X)−1 X T Y

2.2.2 Principal Component Regression


Principal component regression (PCR) is a statistical method based on princi-
pal component analysis (PCA) [34] which is used for regression introduced by
Kendall in 1957 [35]. This technique can be useful when data have high mul-
ticollinearity, i.e. when two or more variables are highly linearly related. This
is the case of LIBS data, where adjacent emission lines generally have high
correlation. To overcome this problem, instead of doing linear regression di-
rectly on raw data, this is first analyzed with PCA, that works by converting
the unprocessed data into a set of linearly uncorrelated variables: the so called
principal components. These components, which are linearly independent,
and are defined in such a way that the first component accounts for the largest
possible variance, the second component for the second largest possible vari-
ance and so on. After this decomposition is done, a linear model is used for
regression on the principal components. This allows dimensionality reduc-
tion, prevents overfitting and at the same time avoids the problem caused by
collinearity.

These are the main steps of this technique:

1. Standardization of X, by subtracting µ and dividing by σ, which are


respectively the mean and the standard deviation of each column of X
X −µ
X=
σ

2. Covariance matrix calculation


1
C= (X)T (X)
n

12
3. Eigendecomposition of C

V CV T = Λ

4. Choose k eigenvectors of matrix V with the largest eigenvalues (princi-


pal components) and transform samples onto the new subspace.

W = [v1 , ... , vk ]

Z = XW

5. OLS regression β coefficients are obtained from regressing on principal


components Z:
βZ = (Z T Z)−1 Z T Y

6. The regression coefficients in the original space X can then be obtained


from
Ŷ = ZβZ = XW βZ = X(W βZ ) = XβX
βX = W βZ

2.2.3 Partial Least Squares


Partial Least Squares (PLS), introduced in 1966 by Herman Wold [36], is a
method that has been widely used in many areas of analytical chemistry.
Similarly to PCR , this technique allows to reduce data dimensionality but also
take into account the target variable Y .
The principle behind PLS is that both the independent and dependent vari-
ables X and Y are decomposed in their latent structures. PLS models find the
directions in the X space that account for the maximum variance directions in
the Y space by decomposing X and Y as X = T P T and Y = U QT , and then
do regression using T and U matrices. Using this decomposition, the latent
variable corresponding to the most variation in Y will be explained by a latent
variable in X that describes this variation the best.

After standardizing both X and Y, the following steps are applied to find each
latent variable:

1. Singular value decomposition (SVD) is done on the matrix S = X T Y


to find w and q, the first left and right singular vectors respectively.

13
2. Scores t and u are obtained from

t = Xw = Ew

u = Y q = Fq

3. X and Y loadings p and q are found by regressing against t

p = ET t

q = FTt

4. data matrices E and F are deflated

En+1 = En − tpT
Fn+1 = Fn − tq T

The estimation of the next component is done repeating steps 1 - 4 starting


T
from the SVD of the matrix En+1 Fn+1 .
After every iteration, vectors w, t, p and q are saved as columns in matrices
W, T, P and Q, respectively.
Finally, β coefficients are calculated

β = (T T T )−1 T T U

Y values given any X can then be found using

Y = U QT = T β QT = XP β QT

2.2.4 Elastic Net


Elastic Net (EN) [37] is a regularized regression method that linearly com-
bines the L1 and L2 penalties of the Lasso [38] and ridge methods [39]. This
technique allows to perform variable selection and regularization simultane-
ously and is well suited for data where the number of features is greater than
the number of samples used.
Elastic Net first emerged as a result of critique on Lasso, whose variable selec-
tion can be unstable as in a case where the variables are in highly correlated
groups as lasso tends to choose one variable from such groups and ignore
the rest entirely. To overcome this problem the penalties of ridge Regression
and Lasso are combined to get the best of both. This combination allows for

14
learning a sparse model where few of the coefficients are non-zero like Lasso,
reducing the number of features upon which the given solution is dependent,
while still maintaining the regularization properties of ridge.
Elastic Net aims at minimizing the following loss function with respect to β:
 2  
n m m m
X
yi −
X 1−αX 2
X
βj xij + λ
  βj + α |βj |
i=1 j=1 2 j=1 j=1

1−α
 
= kY − Xβk22 +λ kβk22 + α kβk1
2

where X is the independent variable of dimensions n × p, Y is the dependent


variable of dimensions n×1, and the β regression coefficient vector of dimen-
sions p × 1 in the one-dimensional case.
The hyperparameter λ determines the amount of shrinkage and α is the mix-
ing parameter between ridge (α = 0) and lasso (α = 1). The larger the value
of λ, the greater the amount of shrinkage and the less the number of selected
features.

2.2.5 Random Forest


First introduced by Tin Kam Ho [40], Random Forest (RF) is an ensemble
method, i.e. a technique that combines multiple predictors together, yielding
better predictions than when using a single model, that can be used for either
classification or regression. In the latter scenario, it works by constructing
multiple decision trees and then outputs their average as prediction.
Random forests usually avoid the problems of a single decision tree which is
overfitting thanks to bagging (also called bootstrap aggregating). Bagging is a
technique that, given the training set X and the response variables Y, involves
random sampling of small subset of data from the dataset to which trees are
fitted.
However instead of using bagging on a subset of the training set, this is used
instead to select a subset of the decision features for each tree. This ensures
that the the model potentially takes into account all available features instead
of relying too much on just one or few features.
Thus, the use of bagging introduces randomization and allows the ensemble
model to reduce variance, since each tree is run individually and then their
results are aggregated together, and at the same time avoid overfitting.

15
2.2.6 Kernel Methods
Kernel algorithms can be used to model the nonlinear relationships between
independent and dependent variables using linear models. This is done by
transforming the original data from the input space into a higher dimensional
feature space by a mapping function φ(x), and then constructing a linear model
in this higher dimensional space [41].
However, by taking advantage of the fact that some algorithms work only with
inner products in some feature space, such as for example in the dual repre-
sentation of Support Vector Machines (SVM) [42], instead of computing the
coordinates in the new φ space, the nonlinear relationships can be implicitly
found by replacing the scalar products with the kernel function k(x, x0 ) =
φ(x)φ(x0 )T and directly computed in input space.
This procedure is generally referred as kernel trick and avoids the explicit map-
ping needed to learn a nonlinear relation using a linear method. This is done
because it is generally much cheaper to compute kernels than to explicitly com-
pute the new coordinates.
Nonlinear versions of many algorithms such as for example SVM [42] and
PCA [43] can be formulated using the kernel trick.

2.3 Preprocessing
One of the major downsides of LIBS is the fact that this technique suffers from
fluctuation of laser energy, matrix effects, and other experimental interference
factors. This in turn causes the collected LIBS spectra to present light intensity
fluctuations across measurements, low signal-to-noise ratio, or the presence of
baseline. Due to this, the analysis of the spectra is unreliable and inaccurate
without the application of preprocessing steps to reduce the effects of interfer-
ence factors before the actual analysis with any regression method.
As mentioned in chapter 1 the most common preprocessing steps are basically
three: normalization, baseline correction and denoising.

2.3.1 Normalization
One strategy that helps to mitigate the effect of fluctuations in laser energy
across measurements is the use of normalization techniques which generally
consists in an adjustment of values to a common scale. Across literature sev-
eral normalization procedures have been proposed. However, there is not one
specific normalization universally accepted as the best one, since different au-

16
thors state that one normalization strategy is better than the others. The main
normalization strategies are:

• Max intensity normalization [9] : each spectrum is normalized by di-


viding all data points, i.e. the intensity value at each pixel, by the max-
imum peak intensity value

• Total intensity normalization [9] : each spectrum is normalized by


dividing by the sum of all intensity lines

• Internal standard or reference line normalization [7]–[9], [12] : each


spectrum is normalized using the intensity value one of the peaks of the
matrix element, i.e. the major element in the sample.

• Standard normal variate transformation [11] : subtract the intensity


mean from every data point and divide them by the standard deviation

• Unit norm normalization [9], [10] : each emission line is scaled indi-
vidually so that the spectrum has unit norm

2.3.2 Baseline Correction


One of biggest causes of error in LIBS data comes from the so called base-
line in the spectrum, which is a shift in amplitude that changes across different
wavelengths. This problem is common not only in LIBS spectra but also any
other type of spectral data such as Near-infrared (NIR) or Raman spectra.
Baseline in LIBS is primarily generated by the continuous radiation of free
electrons and recombination processes in the plasma and makes the quanti-
tative analyses of emission lines with spectral interference more complicated
[32], [44]. Therefore, it is necessary to find the baseline and subtract it from
the analytical signal before further analysis. This procedure is called baseline
correction or background removal.
The classical manual approach requires one to analyze few of the samples
spectra, recognizing the peaks of the different elements, and then subtract a
portion of the spectra based on his/her judgement. This means that different
people will put baseline in different points and that also some domain knowl-
edge is required.
To avoid a subjective background removal, some attempts to create algo-
rithms that automatically estimate baseline to be subtracted from the spectrum

17
have been made.
There are several approaches to estimate baseline but the most used methods
are based on 3 main techniques: polynomial fitting, penalized or weighted
least squares, and wavelet transform [45], [46]. However, each of these meth-
ods have their own drawbacks and there is not one that can be employed with
perfect results in every practical application.
Literature points out that even if there is not a single best method, least
squares baseline correction method appear to be the best compromise for sim-
plicity of use and baseline accuracy [45]–[47].
For what regards manual polynomial fitting methods, the main drawback is that
the baseline depend on the analysts’ experience. Some modified polynomial
fitting method have been proposed that overcome this problem but cannot work
well with low signal-to-noise and low signal-to-background ratio signals.
Wavelet transform baseline correction algorithms only remove the baseline
successfully when the transformed domain of the signal is well-separated.
However, this hypothesis does not hold in most real world signals. Further-
more, it is very difficult to find the right wavelet family to correctly estimate
the baseline.

2.4 Feature Selection Methods


Even if multivariate methods can exploit all the information present in a spec-
tra, using every emission line for prediction can lead to bad results since some
spectral regions may contain noise or useless information which can reduce
methods accuracy [22]. Therefore, it is of paramount importance to avoid
those lines which can downgrade models predictions and only use those which
have a beneficial effect.
Feature selection (also known as variable selection) refers to the process
of selecting, either manually or automatically, the variables that contribute the
most to model predictions.
Feature selection has been widely used in machine learning, but, as stated
in [22], [48], variable selection methods have not received the attention they
deserve in LIBS spectroscopy even if this is step is often crucial to obtain
robust modeling results.

18
2.5 Related Work
2.5.1 Least Squares Baseline Correction
Least squares baseline correction methods are said to be the best background
removal technique family since they are those which require the least domain
knowledge and parameter tuning, and have good performance on different
spectra.
The first ones to propose a baseline estimation method were Pearson and Wal-
ter in 1970 [49]. This algorithm works iteratively and distinguishes the peak
points from baseline points simultaneously by analyzing the standard devia-
tion of the points in a specific interval. In more recent years methods based
on least square regression have been proposed. The first one to introduce this
idea was Liang et al. [50] who introduced the roughness penalty method to de-
crease the influence of the measurement noise. Many other methods have then
been proposed starting with Asymmetric Least Squares (AsLS) by Boelens et
al. [51]. AsLS method combines a smoother with asymmetric weighting of
deviations from the smooth trend using a smoothness λ, introduced to tune the
balance between the smoothness and fitness, to form an effective baseline esti-
mation method. This has been the first of the “automatic” baseline correction
algorithms since it is does not require the selection of peaks or any other user
intervention.
Recently, Zhang et al. [45] developed a novel algorithm called adaptive
iteratively reweighted Penalized Least Squares (airPLS) that works by iter-
atively changing the weights of sum-squared errors between fitted baseline
and the original spectrum. There exist few other variations of least squares
baseline correction and among the most recent ones we have asymmetrically
reweighted penalized least squares smoothing (arPLS) by Baek [52] in 2014
that has a different method of weighting with respect to airPLS. Also He et al.
improved the classical AsLS by introducing the improved asymmetric least
squares method (IAsLS) [53] by using a different formula for the penalized
least squares function.
To the best of my knowledge the best method based on least squares for
baseline correction has been introduced by Xu in 2019 [54]. This method,
called drPLS (doubly reweighted Penalized Least Squares) is an improvement
of the airPLS algorithm and differs from it by adaptively changing the airPLS
smoothness parameter λ at each iteration. To obtain a better fitted baseline,
also the constraints on the proximity of the first derivative of the original spec-
trum and the established spectrum are considered.

19
Chapter 3

Methods

The process that was followed to obtain the elemental concentration from the
thousands of raw spectra recorded by the LIBS spectrometer is roughly com-
posed of three steps, as shown in Figure 3.1.
First, data is filtered, baseline corrected, normalized and finally averaged dur-
ing the preprocessing phase. After obtaining one single spectrum per sample,
the second phase consists of feature selection by the use of methods based on
the a priori knowledge of the sample to choose spectral regions based on the
element of interest, or automatically using feature selection algorithms. After
selecting the most informative variables (i.e. emission lines) then this subset is
fed as input to the chosen multivariate method in order to get the concentration
of the element of interest as output.

3.1 Data
To carry out experiments, two datasets were used: the “aluminium” dataset
and the “slag” dataset, composed of 27 and 34 samples respectively. The alu-
minium dataset consists of samples with an aluminium matrix, i.e. composed
mainly of aluminium, while the slag dataset has a calcium matrix.
It must me noted that the even though the LIBS technique is the same, the
shape of the average spectra in the two dataset will vary a lot due to the dif-
ferent elemental compositions of the samples. The most evident peaks in the
two datasets will be those of the matrix element: for aluminium-matrix sam-
ples the aluminium emission lines around 390 nm and for slag-matrix samples
those of calcium around 315 nm.
These datasets were acquired on two different LIBS setups and are composed
of two/three thousands measurements per sample. An important difference

20
Figure 3.1: Pipeline schematic. The text in red describes the shape of data
before and after each step

between the two setups, besides the laser and the spectrometer used, is time-
gating. In the aluminium setup, the laser and the spectrometer are synchro-
nized. This means that the spectrometer can be time-gated, i.e. can be ac-
tivated at specific times after the laser pulse. This in turn allows to start the
measurements after a couple of microseconds, when the background baseline
is at its minimum. This difference in baselines can clearly be seen by observing
the two signals in Figure 3.2.
The setup used for the Aluminium dataset used a spectrometer with 4094
pixels to cover the wavelengths between 188.20 and 440.78 nm. The Slag setup
used a spectrometer covering the range between 216.33 and 340.37 nm, using
2048 pixels.

21
(a) Aluminium setup raw spectrum

(b) Slag setup raw spectrum

Figure 3.2: Two measurements recorded with using the two different LIBS
setup. Due to time gating, figure (a) has almost no baseline, which is instead
much more visible in figure (b).

Both datasets are very similar in structure. The input data X (i.e. the
raw spectra LIBS measurements for each sample) can be viewed as tensor of
dimension n × m × k where n is the number of samples, m the number of pix-
els used by the spectrometer, and k the number of measurements per sample
recorded, where one measurement is one single spectrum.
As each of the m pixels corresponds to a different wavelength, pixels can also
be interpreted as the features of each sample.
Generally speaking, the higher number of pixels used by the spectrometer (i.e.
higher resolution of the spectrometer) the better, as emission lines can be dis-

22
tinguished more easily.
The output Y to predict was instead the true concentration value of one of the
elements in the sample expressed as mass percentage. The true concentration
values were obtained using XRF chemical analysis on the dataset samples.
The concentration values of the element composing the aluminium and slag
dataset can be found in Appendix in Table A.1 and A.2.

3.2 Metrics
The metrics normally used in literature to assess the quality of predictions for
regression are mainly two:
• Coefficient of determination R2 : used to evaluate the proportion of the
variance in the dependent variable that is predictable from the indepen-
dent variable(s). In regression, this coefficient is a statistical measure of
how well the regression predictions approximate the real data values.
Pn
(yi − ybi )2
R = 1 − Pi=1
2
n 2
i=1 (yi − ȳ)
where yi are the true target values, ŷi the predicted target values, and ȳ
is the mean of the target variable.
The maximum value for R2 is 1 and it indicates that the regression pre-
dictions perfectly fit the data.

• Root mean squared error (RMSE): the square root of the average squared
difference between the predicted and the actual values.
sP
n
i=1 (yi − ybi )2
RM SE =
n
The lower the RMSE value the better. A RMSE = 0 indicates a perfect
fit between predicted and real target values.
It must be noted that comparison across different targets would be invalid
as this metric is dependent on the scale of the numbers used. To over-
come this problem root mean square error can be normalized to compare
different target variables. This is usually done by dividing either by the
mean or the range, i.e. the difference between the maximum and min-
imum, of the measured variable. This value is referred as Normalized
Root Mean Squared Error (NRMSE).
RM SE RM SE
N RM SE = or N RM SE =
ȳ ymax − ymin

23
3.3 Evaluation
To evaluate the performance of the various methods, datasets were split into
train and test set using an 80/20 split ratio. After fitting the model using the
training set, the model was then evaluated on the test set by calculating the R2
and RMSE scores.
As datasets had very reduced size, in order to avoid that the performance was
solely dependent on a "lucky split" and not due to the actual model having
good predictive ability, each experiment was repeated 15 times using different
random seeds for the train/test split. Then the average and the standard devi-
ation of the R2 and RMSE scores was calculated from the scores each split.
As the samples used for training and test changed with different random seed,
this procedure ensured that the averaged results were closer to reality and not
due to chance.
K-fold cross-validation was also used to identify the optimal hyperparameters
of the various methods, such as the optimal number of latent variables for Par-
tial Least Squares. K-fold cross-validation divides the training set in k separate
sets. Each of this sets is used as validation set, while using the remaining k-1
are used for fitting the model. This process is then repeated k times, with each
of the k sets used once as the validation data. The k results are averaged to pro-
duce a single estimation. This estimation was used to identify which were the
optimal parameters by selecting those who retained the lowest RMSE score.
Grid search was instead used to find the optimal hyperparameters α and λ of
Elastic Net model. Grid search is an exhaustive searching through a specified
subset of the hyperparameter space of a learning algorithm that is normally
paired with cross-validation to find the optimal tuple of hyperparameters.

3.4 Preprocessing
LIBS is generally used assuming that the sample of interest has a more or less
homogeneous chemical composition. For this reason, it is common practice
to take multiple single shot measurements of the same sample (from at least
50/100 shots up to 3000) to be sure to get a sufficient number of good spec-
tra, which are then averaged after the application of preprocessing techniques.
This is done in order to get one single representative spectrum that reflects
the real composition of the sample to be used for fitting during the regression
phase.
Preprocessing steps such as filtering or normalization are needed since, as

24
mentioned in section 2.3, LIBS spectra often need improvement and the use
of these techniques helps to correct interference factors such as baseline and
light intensity fluctuations across measurements.
A preprocessing pipeline was created for this study, which consists in four
consecutive steps that are applied to the raw measurements of each sample: in-
tensity filtering, baseline correction, normalization and finally averaging over
sample measurements.

3.4.1 Intensity filtering


The first step that was applied to raw data was a simple intensity filtering.
This is needed as it can happen that due to laser fluctuations, a single shot
spectrum is not strong enough and has a low signal-to-noise ratio. To discard
these bad spectra, measurements were filtered based on the intensity value of
some reference peak which is normally chosen based on the matrix element
of a sample. To filter the aluminium matrix samples the aluminium peak at
309.30 nm was used, whereas for the slag matrix the calcium peak at 317.93
nm.
The intensity threshold value used was of 6000 counts for both datasets as this
was the value previously used in the univariate method for filtering these kind
of low SNR signals. Spectra with a peak intensity value below that number
are generally considered too noisy and therefore must be filtered out.
It can also happen that the spectra are saturated at certain wavelengths, as a
maximum intensity threshold is usually put by the spectrometer itself when it
gets saturated. These spectra should be excluded since, at those wavelengths
at which the intensity has reached the saturation value of the spectrometer,
the signal is cut. Therefore, the assumption of the proportional relationship
between the concentration and the intensity value is not true for such wave-
lengths. To filter these kind of bad spectra, a filter based on the maximum
value of the signal was used. If the maximum was above 60000 counts it was
automatically not included in the selection of spectra for the next preprocess-
ing phases.
Examples of these two types of bad spectra can be seen in Figure 3.3

3.4.2 Baseline Correction


Baseline is a shift in amplitude that changes across different wavelengths of a
spectrum. This drift negatively affects the results of qualitative or quantitative
analysis, especially in a multivariate scenario. Therefore, baseline correction

25
(a) Low signal-to-noise ratio spectrum due to laser fluctuations

(b) Saturated spectrum

Figure 3.3: Two examples of bad spectrum. Figure (a) is an example of low
intensity signal with a low SNR. Figure (b) is an example of saturated spectrum
cutted at 65000 digital counts around 280 nm.

must be applied to the signal before further proceeding into the analysis. How-
ever, it must be noted that baseline correction is very tricky as it is impossible
to say with certainty what is signal and what is baseline in a spectrum. So it
is also possible to accidentally remove part of the information in the spectrum
and therefore reduce the quality of the signal when applying baseline correc-
tion.
Baseline phenomena was present in the spectra of both datasets used in this
research, especially in the slag one. As mentioned in section 3.1, time-gating
reduced the baseline effect in the measurements of the aluminium dataset mea-

26
surements.
In accordance with [45]–[47], I decided to focus on least squares methods
since they are those which require the least domain knowledge and parameter
tuning, and have good performance on different spectra.
After trying out several methods, namely AsLS [51], airPLS [45], arPLS [52],
and drPLS [54], based on the statements of [54] and qualitative analysis, drPLS
was identified as the best baseline correction algorithm.
drPLS is a modification of the original airPLS algorithm. airPLS works by
iteratively changing the weights of sum-squared errors between fitted baseline
and the original spectrum. However, in regions of the spectrum containing
overlapping peaks this algorithm fails to recognise part of the signal as such
and incorrectly regards it as baseline. Using a larger λ smoothness parameter
(the lager this parameter, the smoother the fitted baseline) can help reduce this
problem but at the same time introduces sources of error as the baseline may
then not be able to fit properly other regions of the spectrum.
To account for this problem, drPLS adaptively changes the parameter λ at each
iteration by introducing a constraint on the proximity of the first derivative of
the original spectrum and the predicted baseline at each iteration. This strategy
allows to obtain signal fidelity and smoothness at the same time.

3.4.3 Normalization
All the normalization techniques mentioned in 2.3.1 were explored, namely:
max intensity normalization, total intensity normalization, reference line nor-
malization, unit norm normalization and standard normal variate transform.
Each of the normalization techniques was used after baseline correction of the
raw filtered spectra, and then normalized spectra were averaged across mea-
surements so that each sample had a corresponding normalized spectrum. This
allowed the creation of 5 different dataframes composed of one corrected and
normalized spectrum per sample that had been scaled by using different nor-
malization techniques plus a sixth dataframe using no normalization at all to
compare results. The preprocessing pipeline was also applied to raw without
the baseline correction step, to prove the beneficial effects of this technique,
thereby creating other 6 dataframes.

3.5 Full spectrum multivariate analysis


After obtaining one single spectrum per sample by applying all the prepro-
cessing steps to the raw spectra and then averaging them, multivariate methods

27
were tested.
Several methods were tested, namely Ordinary Least Squares (OLS), Princi-
pal Component Regression (PCR), Partial Least Squares (PLS), Elastic Net
(EN), Random Forest (RF) and kernelized version of PCR (k-PCR) and Sup-
port Vector Machine regression (k-SVR). Three types of kernel were used for
both kernel methods: simple linear kernel k(x, x0 ) = xT x0 , second order ho-
mogeneous polynomial kernel k(x, x0 ) = (xT x0 + 1)2 , and Gaussian kernel
0 ||2
k(x, x0 ) = exp( −||x−x

).
The various methods were evaluated following the procedure explained in
3.3, i.e. splitting data into train and test set using a 80/20 ratio. The various
models were fitted to the training data and then tested on the test set. This
procedure was repeated 15 times using different splits and the scores obtained
were then averaged. R2 and RMSE scores were used to compare the various
methods. The higher the R2 and the lower the RMSE, the better the model
performance.
To find the optimal hyperparameters values for PLS, i.e. the optimal number
of latent variables 10-fold cross-validation was used. The same procedure was
used to understand the optimal number of principal components or the optimal
number of trees for PCR and Random forest respectively. Grid search cross
validation was used for Elastic Net to tune λ and α parameters.

3.6 Feature selection methods


Using the whole spectrum for prediction often results in bad prediction results
[22], [48], since some spectral regions contain noise or useless information.
Therefore, it is of necessary to discard those emission lines which downgrade
models predictions and use only those which have a beneficial effect.
A first attempt of using a variable selection method that selected spec-
tral regions based on a priori knowledge of the chemical composition of the
sample using the strongest emission peaks of each element was made. This ap-
proach, that I called peak-interval method, required the knowledge of the var-
ious emission peaks for each element, which were found using NIST database.
Once the various peaks of one element were known, a small script allowed to
select the emission lines contained in small interval around each peak.
Four interval coefficients were used: 0.1, 0.2, 0.5 and 1 nm. An interval of
length 0.1 corresponded to around 3/4 emission lines per peak, while one of
1 included about 25 emission lines. If the absolute value of the difference be-
tween the peak wavelength and the wavelengths of the spectrum was less than

28
the coefficient value, the corresponding intensity lines were included in the
subset of selected variables.
There also exist several variable selection methods that do not require a
previous knowledge of the sample composition to find the most informative
variables in the spectrum. However, many of these methods were developed
for other types of spectroscopic data such as Raman or Near Infrared spec-
tra as use of feature selection methods was not common in papers concerning
LIBS data [22]. The majority of feature selection methods found in litera-
ture for spectroscopic data are based on Partial Least Squares and exploit this
method to identify which emission lines to select for to be used later in the
actual predictions. This is probably due to the fact that PLS is probably the
most used method for spectroscopic data in quantitative multivariate analysis,
LIBS included. The paper "A review of variable selection methods in Partial
Least Squares Regression" by Mehmood et al. [55], gives a good overview
of various PLS-based wrapper methods, i.e. methods that rely on a predeter-
mined learning algorithm and uses its performance to evaluate and determine
the features that are selected.
This paper was the main source of inspiration for the feature selection meth-
ods investigation. Several of the techniques therein described were imple-
mented, namely VIP filter method, Monte Carlo Uninformative Variable Elim-
ination (MC-UVE), Genetic Algorithm (GA), Backward Variable Elimination
(BVE), Regularized Elimination Procedure (REP), Iterative Predictor Weight-
ing (IPW), and Interval PLS (iPLS).
All the methods above filter and select variables based on the performance of
a fitted PLS model or on one of the (partial) outcomes (or their combination)
of the PLS fitting process such as the β regression coefficients or the loading
weights.
VIP filter method is based on Variable importance in projection (VIP),
a measure that estimate the importance of each variable by projecting PLS
loading weights on each component. According to [56], variables with a VIP
threshold close or greater than 1 are said to be considered important. This is
why in the implemented VIP filter method a threshold of 1 was used to select
variables.
Backward Variable Elimination (BVE) [57] is a wrapper method where
variables are filtered based on an importance coefficient (such as VIP or β re-
gression coefficients) for each variable that must be above a certain threshold.
Then a new model is fitted again on the subset of variables. This process is
repeated until maximum model performance is achieved.
In the implementation of this algorithm variables where filtered based on VIP

29
score values that needed to be greater than 1.
Similarly to BVE, in Regularized Elimination Procedure (REP) [58]
variables are sorted based on some threshold value such as VIP or β coeffi-
cients. Of those below the threshold, a fraction is discarded and the remaining
variables are used to fit a new model. This procedure is repeated until all the
filter coefficients are above the threshold.
In the implemented REP method, VIP scores needed to be above 0.9 and at
each iteration 75% of the variables below the threshold were discarded.
In Monte Carlo Uninformative Variable Elimination (MCUVE) method
[59], variables are assigned a stability value. Repeated splitting of data is used
to divide samples in training and test set. By using cross-validation on the
training set to fit different PLS models, the stability values of each variable
can be found. These values are used to filter variables which are less stable.
In the implementation used data was splitted 5 times and 10-fold cross valida-
tion was used to calculate the stability coefficients.
Genetic algorithms (GA) are a well known family of algorithms inspired
by evolution theory, where variables that are yielding the most performing
regression model have a higher probability to "survive" and to be include in
the next generation of selected variables. Over successive generations, the
selected variables "evolve" to an optimal subset. In Genetic Algorithm PLS
(GA-PLS) [24], the selection process begins by first selecting at random a pop-
ulation of variable sets. After fitting a PLS model to each set and evaluating
their performance via cross-validation, the most promising variables of each
set are selected for the next generation. At this point crossover (where selected
variables are exchanged between sets to generate offspring) and mutation (ran-
domly selecting or discarding some of the variables in the next generation with
a small probability) are applied to the population. This process is then repeated
for a selected number of iterations where at each iteration the survived and mu-
tated sets are used to create a new population.
In the implementation used for this study, the number of generations (itera-
tions) was set to 100, a population size of 250 and a 15% mutation probability
were used.
Iterative Predictor Weighting PLS (IPW-PLS) [26] method uses a vari-
able importance measure to iteratively discard the non-informative variables
that were below a certain threshold after fitting a PLS model. The crucial
step of IPW-PLS is to multiply the variables by their importance in the cyclic
repetition of PLS regression. The importance of the variable is defined as

|βi |si
zi = Pn
i=1 |βi |si

30
where βi and si are the regression coefficients and standard deviation of vari-
able i. After few cycles the algorithm converges to an optimal subset of vari-
ables.
In the implementation used, three variable importance thresholds were tested,
namely 0.01, 0.001 and 0.0001.
Interval PLS (iPLS) is a technique which was first designed for wave-
length selection of NIR spectra by [25]. The algorithm splits the spectrum
into several intervals, and then fits a PLS model to each interval. The interval
corresponding to the model with the lowest prediction error is then selected.
By then fitting PLS models to the union of the selected interval(s) in the pre-
vious iteration(s) and the remaining intervals the next best interval is found in
an iterative manner. The algorithm stops when adding new intervals does not
improve prediction error or the maximum number of iterations is reached.
The main parameter of this method is the number of interval used to divide
the spectrum. In the implementation used, the number of intervals tested were
10, 50, 100 and 500.
In order to avoid using only PLS-based methods, Random Forest, whose
ability to select important variable is well known in the data science com-
munity [60], [61], was also employed as feature selector. To the best of my
knowledge this was the first attempt to use Random Forest as variable selec-
tion method for LIBS regression analysis.
Random Forest can be used to select variables based on variance. The more
one variable in a tree decreases variance, the more important that feature is.
By averaging across trees it is possible to determine the importance of a vari-
able. The variables are selected if their importance is greater than the mean
importance of all features.
In the implementation used a Random Forest composed of 1000 trees was em-
ployed for selecting variables.
After identifying different variable subsets using the various feature selec-
tion methods, these were fed as input to some of the regression methods of the
previous full spectrum analysis to calculate the concentrations of the element
of interest.
The same evaluation procedure of the full spectrum analysis, i.e. 15 runs us-
ing different random splits together with cross-validation for hyperparameter
tuning, was used to assess the performance of the various methods.

31
Chapter 4

Results and Discussion

Generally speaking, predictions on LIBS data are said to be good if the R2


values obtained for the element of interest is greater than 0.9 (or 90%). To
compare the results obtained using the various methods in this study, univari-
ate calibration analysis on the same datasets was also performed in parallel
by Swerim experts. The scores obtained by the traditional univariate method
were used as benchmark, i.e. performances to exceed, using the proposed mul-
tivariate approach. Tables 4.1 and 4.2 show the performance scores obtained
for the different elements using the univariate approach on the aluminium and
slag matrix samples respectively.

Table 4.1: Univariate analysis scores on aluminium data.

Element Si Mg Zn Mn Cu
R2 0.976 0.992 0.995 0.937 0.992
RMSE 0.548 0.133 0.123 0.110 0.147

Table 4.2: Univariate analysis scores on slag data

Element Si Mg Cr Mn Fe
R2 0.913 0.756 0.561 0.880 0.864
RMSE 2.272 2.648 2.831 0.506 4.277

32
4.1 Full spectrum results
After preprocessing of the data, the various regression methods mentioned in
3.5 were tested and evaluated for the two datasets. A first test using prepos-
sessed spectra without the application of normalization was made.
Table 4.3 shows the average and standard deviation of R2 and RMSE values
on the 15 runs using different seeds for train/test split on the aluminium ma-
trix dataset for prediction of manganese (Mn) concentration using the whole
spectral information, while Table 4.4 shows the average scores for silicon (Si)
prediction on the slag dataset using all emission lines of non-normalized spec-
tra.
Results showed that the best multivariate methods were Partial Least Squares
and Elastic Net, followed by Ordinary Least Squares and Random Forest. PLS
and Elastic Net showed consistently a higher averaged R2 and lower averaged
RMSE compared to the other methods, no matter the element chosen. Also
the fact that the standard deviations of both R2 and RMSE were lower that the
other models was a signal that these two models were less prone to swinging
in score values and therefore more robust in predictions.
Non linear kernel Methods, namely kernel-PCR and kernel-SVR, no matter
the kernel type used, showed instead the worst performance on this type of
data.
This suggests that linear methods are better suited for LIBS predictions. This
can be expected as in ideal conditions the relationship between line intensity
and element concentration is linear.

Table 4.3: Manganese averaged regression scores and relative standard devi-
ation (σ) on aluminium dataset using the whole spectra. Here the kernel used
for both k-PCR and k-SVR was a second order polynomial kernel, which was
the one that showed the best performance.

Method OLS PCR PLS EN RF k-PCR k-SVR


R2 0.839 0.390 0.906 0.905 0.780 0.434 0.301
σR 2 0.14 1.25 0.08 0.10 0.32 1.02 1.38
RMSE 0.144 0.209 0.110 0.112 0.218 0.218 0.239
σRM SE 0.06 0.11 0.04 0.03 0.06 0.11 0.11

33
Table 4.4: Silicon averaged scores and relative standard deviations on slag
data using the whole spectra. The kernel used for both k-PCR and k-SVR was
a second order polynomial kernel, which was the one that showed the best
performance.

Method OLS PCR PLS EN RF k-PCR k-SVR


R2 0.795 0.514 0.805 0.737 0.774 -4.249 0.409
σR 2 0.16 0.50 0.18 0.21 0.16 3.09 0.66
RMSE 3.463 5.255 3.359 4.036 3.637 18.441 5.681
σRM SE 1.20 2.35 1.29 1.35 0.541 5.81 2.55

A second test was made using also the various normalizations techniques
during the preprocessing of raw spectra. After comparing the results obtained
on each of the 5 dataframes using the different normalization techniques and
those obtained without the application of normalization. Table 4.5 and 4.6
shows how total intensity normalization and SNV transform improved the pre-
diction of PLS and Elastic Net methods for silicon prediction on slag data in
the full spectrum scenario. The results obtained using the three other normal-
ization techniques are omitted due to The use of normalization had a positive
impact as it improved the R2 score of few points in percentage no matter the
element of interest or the method used. However, there was not a strategy that
always outperformed the others as, even across elements of the same datasets,
different elements "preferred" different normalization techniques. For exam-
ple in the slag dataset, the best normalization for Si prediction was total inten-
sity normalization, while for Mn and Cr it was SNV transform. Nonetheless,
the most promising normalization strategies appeared to be SNV and total
intensity normalizations, which were those on which the lowest errors were
achieved for most of the elements.
It was also interesting to notice that the optimal λ shrinkage coefficient
of Elastic Net model changed in accordance with the normalization used: λ
appeared to be much higher than when using normalized data going from a
optimal value of 10/100 for not normalized spectra to 10−2 /10−4 with the
application of normalization methods. This is probably due to the scale of
intensity lines with different normalization going from several thousands in
digital counts for unnormalized spectra to values that were around close to
or much smaller than 1 as in reference line normalization and total intensity
normalization respectively.
For what concerns PLS the best performance was found using a number

34
of latent components between 5 and 15. Using less than 5 latent components
resulted in unstable models, while using more than 20 led to overfitting on the
training set and worse generalization ability.

Table 4.5: Full spectrum scores of PLS using the various different normaliza-
tion strategies in prediction of Si content of slag samples.

Normalization None SNV Total int. Unit norm Max int. Ref. line
R2 0.805 0.866 0.926 0.896 0.885 0.876
σR2 0.18 0.09 0.05 0.08 0.12 0.09
RMSE 3.359 3.211 2.198 2.549 2.574 2.801
σRM SE 1.29 1.12 0.80 0.94 0.92 0.90

Table 4.6: Full spectrum scores of Elastic Net using the various different nor-
malization strategies in prediction of Si content of slag samples.

Normalization None SNV Total int. Unit norm Max int. Ref. line
R2 0.737 0.918 0.926 0.913 0.919 0.913
σR2 0.21 0.07 0.03 0.04 0.04 0.06
RMSE 4.036 2.543 2.261 1.901 1.891 2.317
σRM SE 1.35 1.02 0.39 0.53 0.66 0.64

4.2 Feature selection results


The whole spectrum results obtained with multivariate methods, particularly
with PLS and Elastic Net, were promising even though they still did not match
the performance of classical univariate method. A possible reason for this
worse performance could have been explained by the fact that some spectral
regions downgraded the predictive ability of the model when using the whole
spectrum. To prove this hypothesis and try to further improve the predictive
ability of the various regression models, feature selection methods were ap-
plied.
In this second batch of experiments, only the most promising multivariate

35
methods of the full spectrum experiments were used, namely OLS, PLS, Elas-
tic Net and Random Forest.
Indeed, application of feature selection methods before regression proved to
be effective for both the a priori peak-interval method and the other feature
selection methods that did not require preliminary sample knowledge.
A first attempt with the peak-interval method before the application of re-
gression models showed already improved predictions in both datasets. Table
4.7 shows the performance of PLS and Elastic Net after using interval feature
selection approach on preprocessed data using SNV as normalization step.
Smaller interval sizes such as 0.2 and 0.1 nm appeared to work better than
bigger ones even though the optimal interval value was different also based on
the element of interest chosen: for example for silicon and manganese smaller
interval resulted in better performances but for magnesium the best results
were found using larger interval sizes such as 0.5/1 nm.
The results using this prior knowledge method appeared to be competitive with
respect to the classical univariate approach and in fact already showed better
performance for some of the elements.
This was a good signal that using multiple peaks was indeed mitigating matrix
effects and interference factors.
However, this method was later abandoned as the scope of this work was to
find methods that required the least intervention and domain knowledge from
the analyst.

Table 4.7: Peak-interval selection scores of PLS and Elastic Net obtained using
the best interval value for different elements of the aluminium dataset using
SNV normalized data.

Element Mn Si Mg
Interval 0.2 0.1 1
Method PLS Elastic Net PLS Elastic Net PLS Elastic Net
2
R 0.976 0.981 0.974 0.966 0.965 0.972
σR 2 0.03 0.01 0.02 0.03 0.03 0.02
RMSE 0.052 0.053 0.516 0.614 0.167 0.153
σRM SE 0.01 0.02 0.20 0.21 0.09 0.08

To avoid the need of prior knowledge, the various PLS-based feature se-
lection methods together with Random Forest method employed as variable
selector were tested on both datasets. All the PLS-model variable selection

36
methods implemented were based on the fitting of a PLS model with 15 la-
tent components. This number appeared to be the best compromise between
generalization and stability. For what concerns Random Forest, this algorithm
employed 1000 trees as hyperparameter for variable selection.
Also in the variable selection approach the best methods appeared to be Par-
tial Least Squares and Elastic Net. PLS was the best regression method in the
majority of cases but this could also be due to the fact that the majority of
the feature selection algorithms used were based PLS-based, so it is not sur-
prising that this method is working well with the emission lines selected with
them. Table 4.8 represents the results of PLS regression on Mn concentration
on the aluminium samples. In the table, the model is trained on the emission
lines selected by the various feature selection methods using SNV normalized
spectra.
Similar scores were obtained also with Elastic Net that, together with PLS,
proved again its effectiveness for regression on LIBS data. Even though vari-
able selection was improving the predictions of all models, these two methods
were clearly the most performing ones.
The fact that predictions were getting more accurate was an indicator of the
fact that also the various feature selection algorithms were able to understand
which were the most important emission lines for regression of the element of
interest also without prior knowledge of the sample composition.

Table 4.8: Mn scores of PLS obtained using different feature selection algo-
rithms. The aluminium spectra used were normalized using SNV transform.

Method VIP iPLS MCUVE REP BVE GA IPW RF


R2 0.954 0.991 0.250 0.947 0.940 0.964 0.976 0.951
σR2 0.05 0.01 0.68 0.08 0.12 0.05 0.04 0.07
RMSE 0.072 0.033 0.309 0.079 0.075 0.065 0.050 0.068
σRM SE 0.02 0.01 0.14 0.03 0.03 0.02 0.03 0.02

Among all the methods investigated, IPW-PLS and iPLS proved to be the
best feature selection algorithms. Both of these method proved to be very
effective at recognising which wavelengths carried the useful information for
the element of interest, and overcame all the others in predictive ability. Their
performance showed an the highest R2 values and lowest RMSEs.
The R2 score obtained with these methods on some of the elements of the
aluminium and slag datasets can be found in Tables 4.9 and 4.10 respectively.

37
Table 4.9: PLS best results for prediction of different elements of the alu-
minium dataset. "Var sel*" stands for the variable selection method used that
allowed to obtain the best performance. The bold values stand for when the
method outperformed the score of the univariate approach

Element Si Mg Zn Mn Cu
Var sel* IPW-PLS iPLS IPW-PLS iPLS iPLS
R2 0.979 0.981 0.993 0.991 0.970
σR2 0.02 0.01 0.01 0.01 0.06
RMSE 0.430 0.120 0.151 0.033 0.134
σRM SE 0.16 0.03 0.07 0.02 0.05

Table 4.10: PLS best results for prediction of different elements of the slag
dataset. "Var sel*" stands for the variable selection method used that allowed
to obtain the best performance. The bold values stand for when the method
outperformed the score of the univariate approach

Element Si Mg Cr Mn Fe
Var sel * IPW-PLS IPW-PLS iPLS iPLS iPLS
R2 0.973 0.592 0.826 0.906 0.967
σR 2 0.02 0.27 0.11 0.08 0.02
RMSE 1.336 2.944 1.647 0.379 1.645
σRM SE 0.43 0.71 0.50 0.17 0.50

It is interesting to note that the two algorithm have quite different ap-
proaches to when it comes to selecting wavelengths.
IPW-PLS, with its variable importance metric was able to find the strongest
emission lines that are usually also used in univariate approach. The optimal
threshold value appeared to be between 0.001 and 0.01 for most elements.
However, the number of selected variables was always very low, usually be-
tween 1 and 20, and depended on the importance threshold value: the higher
the threshold the lesser the number of variables chosen. The algorithm was
able to successfully identify some of the strongest emission lines of the ele-
ment of interest such as the emission lines even with very restrictive threshold
values. For example using a threshold of 0.1, the 8 pixels which were identi-
fied as important for Mn prediction (namely those at 403.07, 403.19, 403.36,

38
403.42, 403.48, and 403.53 nm) were all near the Mn emission peak at 403.25
nm.
However, by always selecting a very reduced set of important variables, mod-
els risked to be a bit more unstable.
iPLS on the other hand follows instead an "interval approach" by dividing the
spectrum in several regions, and selecting those that have the lowest prediction
error. By doing this, the algorithm usually selected emission lines that were
in different regions in the spectrum. This resulted in overall a higher number
of selected variables, in which the strong (and commonly used in the univari-
ate method) emission lines did not always show up. This was interpreted as a
signal that not always the the best results come from using the strongest peaks.
However some of the element peak emission lines were still selected by the
algorithm.
The number of selected wavelength using iPLS varied a lot based on the num-
ber of intervals going from about 100-200 lines with 50 intervals to about 20
when the number of intervals was set to 500. It appeared that the best num-
ber of intervals was of 500. A higher number of intervals meant that each
PLS model was fitted on a smaller spectral region. By doing this probably
the model was able to pinpoint more accurately which were the good intensity
lines.
Another interesting finding regarding baseline correction was that this pre-
processing step was not always improving the quality of data and instead in
some cases downgraded it, at least for spectra recorded using the aluminium
dataset LIBS setup. This was true in both full spectrum and feature selection
tests. An example of how baseline correction was downgrading predictions
in both scenarios can be found in Tables 4.11 and 4.12, showing the differ-
ent performance of both Elastic Net and PLS on Mn prediction on aluminium
samples with and without the application of baseline correction.
This was probably due to the fact that the DrPLS baseline correction algorithm
was removing part of the information in the spectrum and therefore reducing
the quality of the signal. My hypothesis to explain this phenomenon is that the
time-gating between laser and spectrometer in the aluminium setup allowed to
get spectra of higher quality that presented almost no baseline and application
of baseline correction was instead cutting part of the spectral information.
However, on the slag setup dataset, where time-gating was not used and
baseline is much more visible, DrPLS showed a positive effect in improving
data quality. Using baseline correction improved the R2 score of few points in
percentage (up to 3%) about 50% of the cases, depending on the element of
interest, while in the other half results obtained using baseline corrected data

39
Table 4.11: Full spectrum results obtained on Mn prediction with and without
the application of baseline correction step in the preprocessing phase.

Baseline Not removed Removed


Method PLS EN PLS EN
R2 0.934 0.910 0.906 0.905
σR2 0.07 0.14 0.07 0.11
RMSE 0.087 0.107 0.110 0.112
σRM SE 0.03 0.02 0.03 0.05

Table 4.12: iPLS results obtained on Mn prediction with and without the ap-
plication of baseline correction step in the preprocessing phase.

Baseline Not removed Removed


Method PLS EN PLS EN
2
R 0.991 0.982 0.977 0.978
σR2 0.01 0.02 0.03 0.04
RMSE 0.033 0.041 0.051 0.050
σRM SE 0.02 0.02 0.02 0.02

were almost identical to those obtained using not corrected spectra. An exam-
ple of this can be observed in Table 4.13 that shows the results of chromium
(Cr) prediction on the slag dataset using IPW-PLS method before regression.
The normalization used in the pipeline for this experiment was total intensity
normalization.
The results regarding baseline correction not improving predictions were not
expected and go against literature. Probably the use of time-gating in the alu-
minium setup allowed to record much cleaner spectra that did not necessitate
of baseline correction. This could explain why baseline correction applica-
tion on the aluminium dataset was instead detrimental. This is why one of
my conclusions is that the use of time-gating in a LIBS setup could be a valid
countermeasure to overcome or at least mitigate the presence of baseline in
the spectrum and at the same time avoid the risk of removing part of the infor-
mation in the spectrum by using baseline correction algorithms.
Even though the univariate method was not always beaten in terms of per-
formance, the scores achieved using PLS for some of the elements, such as

40
Table 4.13: IPW-PLS results obtained on Cr prediction (slag dataset) with and
without the application of baseline correction step in the preprocessing phase.

Baseline Not removed Removed


Method PLS EN PLS EN
R2 0.746 0.761 0.805 0.784
σR2 0.22 0.31 0.19 0.39
RMSE 2.095 1.753 1.728 1.533
σRM SE 1.19 0.83 0.69 0.55

Mn in the aluminium dataset or Si and Fe in the slag dataset, were sometimes


higher of several points in percentage. This can be observed in Tables 4.14
and 4.16, which compare the results obtained using the variable chosen with
iPLS, IPW-PLS and peak-interval methods, those using the full spectrum and
the univariate method results on Mn and Si prediction respectively.
A similar comparison using instead Elastic Net model for regression can be
found in Tables 4.15 and 4.17. In all four tables the averaged score values are
in bold if they are superior to the score obtained with the univariate method.

Table 4.14: Comparison of PLS results on Mn prediction (aluminium dataset)


using iPLS, IPW-PLS, peak-interval, the whole spectral information compared
to univariate calibration method.

Method iPLS IPW-PLS Peak-interval Full spectrum Univariate


R2 0.991 0.976 0.977 0.925 0.937
σR 2 0.01 0.02 0.03 0.08 -
RMSE 0.033 0.050 0.050 0.091 0.110
σRM SE 0.02 0.03 0.01 0.03 -

41
Table 4.15: Comparison of Elastic Net results on Mn prediction (aluminium
dataset) using iPLS, IPW-PLS, peak-interval, the whole spectral information
compared to univariate calibration method.

Method iPLS IPW-PLS Peak-interval Full spectrum Univariate


R2 0.982 0.968 0.981 0.910 0.937
σR 2 0.02 0.04 0.02 0.12 -
RMSE 0.041 0.055 0.047 0.107 0.110
σRM SE 0.02 0.02 0.02 0.04 -

Table 4.16: Comparison of PLS results on Si prediction (slag dataset) using


iPLS, IPW-PLS, peak-interval, the whole spectral information compared to
univariate calibration method.

Method iPLS IPW-PLS Peak-interval Full spectrum Univariate


R2 0.973 0.972 0.919 0.886 0.913
σR 2 0.02 0.01 0.05 0.09 -
RMSE 1.336 1.375 2.346 3.211 2.272
σRM SE 0.42 0.30 0.54 1.11 -

Table 4.17: Comparison of Elastic Net results on Si prediction (slag dataset)


using iPLS, IPW-PLS, peak-interval, the whole spectral information compared
to univariate calibration method.

Method iPLS IPW-PLS Peak-interval Full spectrum Univariate


R2 0.978 0.964 0.854 0.918 0.913
σR2 0.02 0.02 0.06 0.07 -
RMSE 1.165 1.513 3.152 2.543 2.272
σRM SE 0.35 0.37 0.63 1.02 -

42
Chapter 5

Conclusions

In this work I presented a pipeline that goes from raw LIBS measurements
to output concentration. The first step consists in preprocessing raw data by
intensity filtering, application of baseline correction, normalization and av-
eraging across measurements. After that, the application of feature selection
methods such as iPLS or IPW-PLS allows to choose the most important vari-
ables to be used for regression. Finally by fitting multivariate methods such
as Elastic Net or PLS on this subset of variables the output concentration are
predicted.
Regarding normalization the best results were obtained using the SNV
transformation and the total intensity normalization. Normalizing the spec-
tra increase by at least a couple of points in percentage the R2 value, no matter
what the element of interest. This proved to be true using either the whole
spectra or the subset found using feature selection methods.
For what concerns baseline correction, results show that its application
brings little or no benefit to the quantitative analysis and that sometimes it can
also downgrade predictions. This latter scenario happened when using base-
line correction on the aluminium dataset as probably correction was cutting
part of the information carried by the signal. It must be also noted that the
LIBS setup of this dataset already produced very clean spectra that had al-
most no baseline thanks to time-gating. Therefore, this may be the reason why
baseline correction did not show its effectiveness on the aluminium dataset.
On the slag measurements the application of baseline correction appeared to
have a better effect on predictions which showed in about 50% of the cases an
improvement of 1-3% in R2 value, while the other half of the times the results
were almost identical.
PLS and Elastic Net were the methods that proved to be the better suited for

43
regression on LIBS data, both in the full spectrum and the variable selection
scenario.
PLS is a technique especially suited for data that have high multicollinearity
and when the number of features is much bigger than the number of samples.
These two factors together make PLS one of the best methods to do regression
on LIBS data, where usually the number of features is much higher than the
number of samples. Elastic Net is well suited for LIBS data as it retains the
sparse properties of lasso regression and the stability of ridge regression in the
case of data with a low number of samples and a high number of features.
The way PLS models are constructed allowed to use this method to create
several PLS-based feature selection algorithms. These algorithm consistently
proved their ability in identifying the important regions in the spectrum and
were competitive in terms of performance with respect to the peak-interval
method based on prior knowledge of the sample composition. Among these,
iPLS and IPW-PLS were those which showed the best variable selection abil-
ity. Another interesting finding was that the variable selected using the various
PLS-based feature selection methods of this study improved the performance
of the multivariate approach, no matter what regression method used. This
proves again that the variables found using these variable selection algorithm
were indeed those carrying the most relevant information for prediction of the
element of interest.
The proposed pipeline composed of preprocessing, feature selection and
regression answers the research questions of 1.1. The application of normal-
ization, in particular SNV and total intensity normalization, allowed to mit-
igate the effect of fluctuations in laser energy across measurements and im-
proves the quality of data. iPLS and IPW-PLS were the feature selection meth-
ods that proved to be the best in selecting the most informative emission lines
to be used in the regression phase and improved predictions the most. Elastic
Net and Partial Least Squares were instead the regression techniques that were
best suited for this type of data and showed the best predictive ability. By com-
bining these techniques together in the proposed pipeline, it was possible to
create a robust method that consistently allowed to get very good predictions
in a way that did not require specific domain knowledge.

5.1 Future Work


For future research it would be interesting to study Wavelet Transform meth-
ods to be used for baseline correction and evaluate if such transformation can
further improve the quality of the data before regression. However, this tech-

44
nique requires a solid understanding of advanced signal theory and therefore
domain knowledge in this field is required to study its effect on LIBS data.
Another aspect that could expand this research could be the investigation
of denoising methods to see if their application further improves data quality
and allows to get better predictions.
A different deep learning approach could also be a valid alternative to lin-
ear regression methods. The "only" requirement that this approach would ne-
cessitate is a dataset of much larger sample size. Bigger datasets could also
improve the robustness of models trained on smaller datasets.
Therefore, the creation of a dataset of big dimensions with a sample size of at
least 1000-2000 data points that could be used as benchmark in order to test
deep networks and classical linear models is of paramount importance.
Nonetheless, the most promising direction for further research on LIBS
regression is that of feature selection. Variable selection techniques are very
used in machine learning and new methods are discovered each year in this
fast evolving field. Therefore, I believe many new techniques coming from
machine learning applied to different fields could be adapted to enable feature
selection on LIBS data. One possible way to do this would be to use model-
based variable selection methods that made use of ridge methods instead of
Partial Least Squares.

45
Bibliography

[1] B. Noharet, T. Irebo, and H. Karlsson, “Compact industrial libs systems


can assist aluminum recycling”, Laser Focus World, vol. 50, no. 10,
pp. 50–52, 2014.
[2] Z. Qin, Y. Peng, F. Meiling, and S. Junqing, “On quantitative analysis
method of detecting the content of molten alloy steel element based
on libs technology”, 2017 36th Chinese Control Conference (CCC),
pp. 3072–3076, 2017.
[3] T. Zhang, S. Wu, J. Dong, J. Wei, K. Wang, H. Tang, X. Yang, and H.
Li, “Quantitative and classification analysis of slag samples by laser in-
duced breakdown spectroscopy (libs) coupled with support vector ma-
chine (svm) and partial least square (pls) methods”, Journal of Analyt-
ical Atomic Spectrometry, vol. 30, no. 2, pp. 368–374, 2015.
[4] E. Bellou, N. Gyftokostas, D. Stefas, O. Gazeli, and S. Couris, “Laser-
induced breakdown spectroscopy assisted by machine learning for olive
oils classification: The effect of the experimental parameters”, Spec-
trochimica Acta Part B: Atomic Spectroscopy, vol. 163, p. 105 746,
2020.
[5] C. L. Goueguel, A. Soumare, C. Nault, and J. Nault, “Direct determi-
nation of soil texture using laser-induced breakdown spectroscopy and
multivariate linear regressions”, Journal of Analytical Atomic Spec-
trometry, vol. 34, no. 8, pp. 1588–1596, 2019.
[6] R. B. Anderson, S. M. Clegg, J. Frydenvang, R. C. Wiens, S. McLen-
nan, R. V. Morris, B. Ehlmann, and M. D. Dyar, “Improved accuracy in
quantitative laser-induced breakdown spectroscopy using sub-models”,
Spectrochimica Acta Part B: Atomic Spectroscopy, vol. 129, pp. 49–57,
2017.

46
[7] J. El Haddad, L. Canioni, and B. Bousquet, “Good practices in libs anal-
ysis: Review and advices”, Spectrochimica Acta Part B: Atomic Spec-
troscopy, vol. 101, pp. 171–182, 2014.
[8] T. Takahashi and B. Thornton, “Quantitative methods for compensa-
tion of matrix effects and self-absorption in laser induced breakdown
spectroscopy signals of solids”, Spectrochimica Acta Part B: Atomic
Spectroscopy, vol. 138, pp. 31–42, 2017.
[9] D. V. Babos, A. I. Barros, J. A. Nóbrega, and E. R. Pereira-Filho, “Cal-
ibration strategies to overcome matrix effects in laser-induced break-
down spectroscopy: Direct calcium and phosphorus determination in
solid mineral supplements”, Spectrochimica Acta Part B: Atomic Spec-
troscopy, vol. 155, pp. 90–98, 2019.
[10] C. Sun, Y. Tian, L. Gao, Y. Niu, T. Zhang, H. Li, Y. Zhang, Z. Yue, N.
Delepine-Gilon, and J. Yu, “Machine learning allows calibration mod-
els to predict trace element concentration in soils with generalized libs
spectra”, Scientific reports, vol. 9, no. 1, pp. 1–18, 2019.
[11] Y. Lu, H. Guo, T. Shen, W. Wang, Y. He, and F. Liu, “Quantitative anal-
ysis of cadmium and zinc in algae using laser-induced breakdown spec-
troscopy”, Analytical Methods, vol. 11, no. 48, pp. 6124–6135, 2019.
[12] B. Zhang, P. Ling, W. Sha, Y. Jiang, and Z. Cui, “Univariate and multi-
variate analysis of phosphorus element in fertilizers using laser-induced
breakdown spectroscopy”, Sensors, vol. 19, no. 7, p. 1727, 2019.
[13] N. L. Database, https://physics.nist.gov/PhysRefData/
ASD/LIBS/libs-form.html.
[14] A. E. Database, https://www.atomtrace.com/elements-
database.
[15] P. Yaroshchyk, D. Death, and S. Spencer, “Comparison of principal
components regression, partial least squares regression, multi-block par-
tial least squares regression, and serial partial least squares regression
algorithms for the analysis of fe in iron ore using libs”, Journal of An-
alytical Atomic Spectrometry, vol. 27, no. 1, pp. 92–98, 2012.
[16] T. F. Boucher, M. V. Ozanne, M. L. Carmosino, M. D. Dyar, S. Mahade-
van, E. A. Breves, K. H. Lepore, and S. M. Clegg, “A study of machine
learning regression methods for major elemental analysis of rocks us-
ing laser-induced breakdown spectroscopy”, Spectrochimica Acta Part
B: Atomic Spectroscopy, vol. 107, pp. 1–10, 2015.

47
[17] M. Dyar, M. Carmosino, E. Breves, M. Ozanne, S. Clegg, and R. Wiens,
“Comparison of partial least squares and lasso regression techniques
as applied to laser-induced breakdown spectroscopy of geological sam-
ples”, Spectrochimica Acta Part B: Atomic Spectroscopy, vol. 70, pp. 51–
67, 2012.
[18] T. Zhang, L. Liang, K. Wang, H. Tang, X. Yang, Y. Duan, and H. Li,
“A novel approach for the quantitative analysis of multiple elements
in steel based on laser-induced breakdown spectroscopy (libs) and ran-
dom forest regression (rfr)”, Journal of Analytical Atomic Spectrome-
try, vol. 29, no. 12, pp. 2323–2329, 2014.
[19] G. Yang, X. Han, C. Wang, Y. Ding, K. Liu, D. Tian, and L. Yao, “The
basicity analysis of sintered ore using laser-induced breakdown spec-
troscopy (libs) combined with random forest regression (rfr)”, Analyti-
cal Methods, vol. 9, no. 36, pp. 5365–5370, 2017.
[20] H. Duan, L. Han, and G. Huang, “Quantitative analysis of major met-
als in agricultural biochar using laser-induced breakdown spectroscopy
with an adaboost artificial neural network algorithm”, Molecules, vol. 24,
no. 20, p. 3753, 2019.
[21] L. Chengxu, W. Bo, X. Jiang, J. Zhang, N. Kang, and Y. Yanwei, “De-
tection of k in soil using time-resolved laser-induced breakdown spec-
troscopy based on convolutional neural networks”, Plasma Science and
Technology, vol. 21, no. 3, p. 034 014, 2018.
[22] A. K. Myakalwar, N. Spegazzini, C. Zhang, S. K. Anubham, R. R.
Dasari, I. Barman, and M. K. Gundawar, “Less is more: Avoiding the
libs dimensionality curse through judicious feature selection for explo-
sive detection”, Scientific reports, vol. 5, p. 13 169, 2015.
[23] J. Guezenoc, L. Bassel, A. Gallet-Budynek, and B. Bousquet, “Vari-
ables selection: A critical issue for quantitative laser-induced break-
down spectroscopy”, Spectrochimica Acta Part B: Atomic Spectroscopy,
vol. 134, pp. 6–10, 2017.
[24] K. Hasegawa, Y. Miyashita, and K. Funatsu, “Ga strategy for variable
selection in qsar studies: Ga-based pls analysis of calcium channel an-
tagonists”, Journal of Chemical Information and Computer Sciences,
vol. 37, no. 2, pp. 306–310, 1997.

48
[25] L. Nørgaard, A. Saudland, J. Wagner, J. P. Nielsen, L. Munck, and
S. B. Engelsen, “Interval partial least-squares regression (i pls): A com-
parative chemometric study with an example from near-infrared spec-
troscopy”, Applied Spectroscopy, vol. 54, no. 3, pp. 413–419, 2000.
[26] M. Forina, C. Casolino, and C. Pizarro Millan, “Iterative predictor weight-
ing (ipw) pls: A technique for the elimination of useless predictors in re-
gression problems”, Journal of Chemometrics: A Journal of the Chemo-
metrics Society, vol. 13, no. 2, pp. 165–184, 1999.
[27] S. D. Goals, https://www.undp.org/content/undp/en/
home/sustainable-development-goals.html.
[28] S. G. 1. R. consumption and production, http://www.undp.org/
content/undp/en/home/sustainable- development-
goals/goal-12-responsible-consumption-and-production.
html.
[29] L. Deng and D. Yu, “Deep learning: Methods and applications”, Foun-
dations and trends in signal processing, vol. 7, no. 3–4, pp. 197–387,
2014.
[30] A. Ciucci, M. Corsi, V. Palleschi, S. Rastelli, A. Salvetti, and E. Tognoni,
“New procedure for quantitative elemental analysis by laser-induced
plasma spectroscopy”, Applied spectroscopy, vol. 53, no. 8, pp. 960–
964, 1999.
[31] A. Bengtson, “Laser induced breakdown spectroscopy compared with
conventional plasma optical emission techniques for the analysis of metals–
a review of applications and analytical performance”, Spectrochimica
Acta Part B: Atomic Spectroscopy, vol. 134, pp. 123–132, 2017.
[32] Y. Guo, L. Deng, X. Yang, J. Li, K. Li, Z. Zhu, L. Guo, X. Li, Y. Lu,
and X. Zeng, “Wavelet-based interference correction for laser-induced
breakdown spectroscopy”, Journal of Analytical Atomic Spectrometry,
vol. 32, no. 12, pp. 2401–2406, 2017.
[33] A. M. Legendre, Nouvelles méthodes pour la détermination des orbites
des comètes. F. Didot, 1805.
[34] K. Pearson, “On lines of closes fit to system of points in space, london,
e dinb”, Dublin Philos. Mag. J. Sci, vol. 2, pp. 559–572, 1901.
[35] M. G. Kendall, “A course in multivariate analysis: London”, Charles
Griffin & Co, 1957.

49
[36] H. Wold, “Estimation of principal components and related models by
iterative least squares”, Multivariate analysis, pp. 391–420, 1966.
[37] H. Zou and T. Hastie, “Regularization and variable selection via the
elastic net”, Journal of the royal statistical society: series B (statistical
methodology), vol. 67, no. 2, pp. 301–320, 2005.
[38] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Jour-
nal of the Royal Statistical Society: Series B (Methodological), vol. 58,
no. 1, pp. 267–288, 1996.
[39] A. N. Tikhonov and V. Y. Arsenin, “Solutions of ill-posed problems”,
New York, pp. 1–30, 1977.
[40] T. K. Ho, “Random decision forests”, Proceedings of 3rd international
conference on document analysis and recognition, vol. 1, pp. 278–282,
1995.
[41] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
[42] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for
optimal margin classifiers”, Proceedings of the fifth annual workshop
on Computational learning theory, pp. 144–152, 1992.
[43] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component anal-
ysis as a kernel eigenvalue problem”, Neural computation, vol. 10, no. 5,
pp. 1299–1319, 1998.
[44] E. Képeš, P. Pořızka, J. Klus, P. Modlitbová, and J. Kaiser, “Influence of
baseline subtraction on laser-induced breakdown spectroscopic data”,
Journal of Analytical Atomic Spectrometry, vol. 33, no. 12, pp. 2107–
2115, 2018.
[45] Z.-M. Zhang, S. Chen, and Y.-Z. Liang, “Baseline correction using
adaptive iteratively reweighted penalized least squares”, Analyst, vol. 135,
no. 5, pp. 1138–1146, 2010.
[46] X. Liu, Z. Zhang, P. F. Sousa, C. Chen, M. Ouyang, Y. Wei, Y. Liang, Y.
Chen, and C. Zhang, “Selective iteratively reweighted quantile regres-
sion for baseline correction”, Analytical and bioanalytical chemistry,
vol. 406, no. 7, pp. 1985–1998, 2014.
[47] X. Liu, Z. Zhang, Y. Liang, P. F. Sousa, Y. Yun, and L. Yu, “Base-
line correction of high resolution spectral profile data based on expo-
nential smoothing”, Chemometrics and Intelligent Laboratory Systems,
vol. 139, pp. 97–108, 2014.

50
[48] X. Fu, F.-J. Duan, T.-T. Huang, L. Ma, J.-J. Jiang, and Y.-C. Li, “A
fast variable selection method for quantitative analysis of soils using
laser-induced breakdown spectroscopy”, Journal of Analytical Atomic
Spectrometry, vol. 32, no. 6, pp. 1166–1176, 2017.
[49] G. A. Pearson and R. I. Walter, “Deconvolution of broad-line nmr spec-
tra containing overlapping modulation sidebands”, Journal of Magnetic
Resonance (1969), vol. 16, no. 2, pp. 348–353, 1974.
[50] Y.-Z. Liang, A. K.-M. Leung, and F.-T. Chau, “A roughness penalty
approach and its application to noisy hyphenated chromatographic two-
way data”, Journal of Chemometrics: A Journal of the Chemometrics
Society, vol. 13, no. 5, pp. 511–524, 1999.
[51] H. F. Boelens, R. J. Dijkstra, P. H. Eilers, F. Fitzpatrick, and J. A. West-
erhuis, “New background correction method for liquid chromatogra-
phy with diode array detection, infrared spectroscopic detection and ra-
man spectroscopic detection”, Journal of chromatography A, vol. 1057,
no. 1-2, pp. 21–30, 2004.
[52] S.-J. Baek, A. Park, Y.-J. Ahn, and J. Choo, “Baseline correction using
asymmetrically reweighted penalized least squares smoothing”, Ana-
lyst, vol. 140, no. 1, pp. 250–257, 2015.
[53] S. He, W. Zhang, L. Liu, Y. Huang, J. He, W. Xie, P. Wu, and C. Du,
“Baseline correction for raman spectra using an improved asymmetric
least squares method”, Analytical Methods, vol. 6, no. 12, pp. 4402–
4407, 2014.
[54] D. Xu, S. Liu, Y. Cai, and C. Yang, “Baseline correction method based
on doubly reweighted penalized least squares”, Applied optics, vol. 58,
no. 14, pp. 3913–3920, 2019.
[55] T. Mehmood, K. H. Liland, L. Snipen, and S. Sæbø, “A review of vari-
able selection methods in partial least squares regression”, Chemomet-
rics and Intelligent Laboratory Systems, vol. 118, pp. 62–69, 2012.
[56] I.-G. Chong and C.-H. Jun, “Performance of some variable selection
methods when multicollinearity is present”, Chemometrics and intelli-
gent laboratory systems, vol. 78, no. 1-2, pp. 103–112, 2005.
[57] I. E. Frank, “Intermediate least squares regression method”, Chemo-
metrics and Intelligent Laboratory Systems, vol. 1, no. 3, pp. 233–242,
1987.

51
[58] T. Mehmood, H. Martens, S. Sæbø, J. Warringer, and L. Snipen, “A par-
tial least squares based algorithm for parsimonious variable selection”,
Algorithms for Molecular Biology, vol. 6, no. 1, p. 27, 2011.
[59] W. Cai, Y. Li, and X. Shao, “A variable selection method based on
uninformative variable elimination for multivariate calibration of near-
infrared spectra”, Chemometrics and intelligent laboratory systems, vol. 90,
no. 2, pp. 188–194, 2008.
[60] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, “Variable selection using
random forests”, Pattern recognition letters, vol. 31, no. 14, pp. 2225–
2236, 2010.
[61] Y. Qi, “Random forest for bioinformatics”, Ensemble machine learning,
pp. 307–323, 2012.

52
Appendix A

Dataset element concentration

Down below it is possible to observe the element concentration of the samples


composing the aluminium and slag dataset.
Table A.1 shows the concentration values of Silicon (Si), Iron (Fe), Magne-
sium (Mg), Zinc (Zn), Copper (Cu), Manganese (Mn), and Aluminium (Al)
of aluminium matrix samples.
Table A.2 shows the concentration values of Magnesium (Mg), Aluminium
(Al), Silicon (Si), Iron (Fe), Manganese (Mn), Chromium (Cr), and Aluminium
(Al) of slag matrix samples.

53
Table A.1: Element concentration values for the aluminium dataset (in mass
percentage)

Sample Si Fe Mg Zn Cu Mn Al
1 0.07 0.10 0.41 0.18 0.06 1.45 97.6
2 0.18 0.35 2.03 6.08 1.35 0.45 89.4
3 7.46 0.53 0.04 0.14 0.15 0.10 91.4
4 3.00 0.80 0.57 0.15 4.29 0.04 90.5
5 0.12 0.30 3.04 4.93 1.85 0.06 89.3
6 7.22 0.14 0.36 0.08 0.12 0.05 91.8
7 0.78 0.08 0.98 1.03 0.84 0.48 95.8
8 6.16 0.00 0.36 0.00 0.01 0.21 93.3
9 0.16 0.20 4.54 0.05 0.05 0.38 94.5
10 6.12 0.00 0.32 0.00 0.00 0.02 93.5
11 0.35 0.27 2.78 5.37 1.60 0.29 89.1
12 0.26 0.42 1.26 0.02 0.00 1.16 96.8
13 8.75 0.46 1.71 0.03 2.00 0.06 85.3
14 0.53 0.23 0.77 0.03 0.01 0.02 98.4
15 0.36 0.60 1.10 0.07 0.20 0.83 96.6
16 0.18 0.50 1.11 0.05 0.15 1.26 96.7
17 0.18 0.20 2.48 5.44 1.60 0.08 89.8
18 12.53 0.31 0.02 0.05 0.06 0.03 86.8
19 9.50 0.00 0.39 0.00 0.06 0.02 90.0
20 0.16 0.31 0.88 0.10 0.11 1.14 97.2
21 9.46 1.19 0.39 0.16 3.10 0.26 84.8
22 8.56 0.00 0.29 0.89 2.64 0.33 86.6
23 0.18 0.23 2.94 0.03 0.06 0.28 95.9
24 9.14 1.01 0.20 0.42 3.60 0.41 84.8
25 0.53 0.18 3.57 5.06 1.90 0.13 88.2
26 9.19 0.00 0.34 0.00 0.01 0.01 90.5
27 9.36 0.86 0.06 1.01 3.48 0.39 84.2

54
Table A.2: Element concentration values for the slag dataset (in mass percent-
age)

Sample Mg Al Si Fe Mn Cr Ca
1 25.00 9.61 13.20 18.30 1.99 0.20 37.98
2 18.00 5.04 14.20 18.90 2.10 0.28 44.72
3 21.10 6.14 10.60 20.20 5.82 0.61 38.72
4 15.50 4.07 12.80 26.70 3.35 0.52 38.44
5 17.10 4.14 14.80 21.70 1.72 0.20 43.38
6 15.90 4.34 16.50 18.90 1.29 0.10 45.53
7 14.30 4.12 16.20 19.00 1.21 0.09 46.35
8 13.00 4.05 16.10 21.00 1.26 0.10 45.29
9 10.60 7.30 14.30 33.20 2.75 0.42 32.81
10 19.30 6.50 14.70 17.40 2.72 0.31 42.64
11 11.20 5.02 14.00 26.10 2.84 2.11 38.44
12 17.40 4.31 12.80 27.70 2.86 2.59 36.41
13 7.90 3.20 33.30 1.30 3.00 3.70 45.00
14 7.90 3.80 35.30 1.30 2.00 3.90 42.50
15 9.60 2.80 35.30 0.90 1.30 2.70 45.70
16 8.90 3.20 29.00 1.30 1.60 11.00 44.10
17 8.40 3.70 34.30 1.70 1.30 9.30 40.30
18 10.50 3.20 34.50 1.00 1.10 3.20 45.30
19 9.10 3.70 32.50 2.00 2.00 4.50 45.20
20 7.70 4.80 30.90 1.70 3.30 7.60 41.50
21 6.80 4.10 27.80 2.50 3.50 12.90 38.90
22 6.10 2.80 25.30 1.70 2.80 17.80 39.00
23 14.10 10.90 24.30 2.45 4.39 15.40 27.24
24 11.80 7.91 21.50 3.87 1.36 6.98 43.43
25 7.20 6.70 5.85 35.00 3.79 2.05 35.12
26 18.40 5.72 25.40 3.46 0.49 1.08 43.81
27 19.90 9.52 28.00 2.86 0.70 3.00 35.25
28 14.90 8.57 13.90 12.80 5.32 6.99 35.39
29 8.90 26.90 11.60 0.27 0.00 0.15 51.40
30 8.94 25.50 11.80 0.14 0.00 0.00 53.00
31 19.90 6.05 28.50 1.39 0.42 0.92 42.55
32 14.00 28.40 5.99 0.45 0.00 0.15 50.10
33 20.00 6.79 27.50 1.37 0.41 0.89 42.53
34 24.40 22.10 7.54 0.47 0.00 0.00 44.50

55
TRITA-EECS-EX-2020:677
www.kth.se

You might also like