Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views37 pages

Steps ML

The document outlines the essential steps and considerations for executing a machine learning project, emphasizing the importance of selecting appropriate tools based on project objectives and data types. It details the stages of a machine learning project, including project definition, data acquisition, preprocessing, feature engineering, model training, evaluation, and deployment, while highlighting the significance of domain knowledge, software engineering, data analysis, and applied mathematics. Additionally, it discusses various data types and formats, as well as the challenges and strategies involved in data preprocessing and handling imbalances.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

Steps ML

The document outlines the essential steps and considerations for executing a machine learning project, emphasizing the importance of selecting appropriate tools based on project objectives and data types. It details the stages of a machine learning project, including project definition, data acquisition, preprocessing, feature engineering, model training, evaluation, and deployment, while highlighting the significance of domain knowledge, software engineering, data analysis, and applied mathematics. Additionally, it discusses various data types and formats, as well as the challenges and strategies involved in data preprocessing and handling imbalances.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

MACHINE LEARNING PROJECT STEPS

 Forward ________________________________________________
The steps that are going to be discussed in this article are the same for
any machine learning project but the suitable tools to use vary
depending on the data and the project objectives. So if you are working
on customer churn, you cannot use the tools designed for image
classification. If you are doing financial time series forecasting, you
cannot use logistic regression to predict future prices.
Tools and techniques are grouped into different categories, just like
there might be different screwdrivers and spanners in one toolbox.
Some tools perform better than others for the same job but engineers
are advised to choose tools based on the data and what works for the
project and not what they fancy. A knife can be used to cut a small wire
but a more appropriate tool in the toolbox for the job is a plier. Likewise
there are various algorithms for feature selection, and clustering, and
cost functions for regression but you cannot just pick anything you like
to build and deploy your model.
The keyword here is toolbox. Over 80% of data science projects fail
because so-called yet inexperienced “data scientists” don’t get
comprehensive education and training on how to go about the craft or
science. To put it bluntly, machine learning is very hard. Machine
learning as a career is also relatively new and there would be few
experienced ML engineers for a while to provide mentorship until
traditional academia generally catches up. To truly get into machine
learning today with certifications, you have to combine domain
knowledge, software engineering, data analysis, and applied
mathematics.
 What is data? ___________________________________________
Statistics is the science of understanding data. Data consists of units of
information that is organized into a certain format or structure. There
are two main types or branches of data (or data types) in statistics and
software.
- Categorical data
This is data that consists of labels. There are two branches of categorical
data:
 Nominal data: These comprises discrete names such as nouns,
addresses, and classes.
 Ordinal data: These comprises labels with rankings or contrasts.
Examples are star ratings, experience and difficulty levels and
traffic colour codes.
- Numerical data
This is data that consists of numbers. There are two branches of
numerical data:
 Discrete values: These comprise discrete numbers that are part of
an entity but do not necessarily indicate measurement. Examples
are the numbers on a die, HTTP status codes, and a number
generated as a hash.
 Continuous values: These comprise numbers that indicate
measurement, range or extent. Examples are length, pressure,
temperature, population, sum of money, and exam scores.

There are three main types of digital data formats: structured, semi-
structured and unstructured data.
 Structured data: This is tabular data. It constitutes a set of values
organized into rows and columns. The rows are normally indicated
by indexing and the columns are marked by labels called features
or attributes. Every row is a record of an entity’s features’ values
at the time of acquisition. Examples are pandas dataframes, SQL
tables and Excel spreadsheets.
 Semi-structured data: This is organized to form a tree structure
with records separated by delimiters or tags. Examples are JSON,
CSV, XML and HTML data.
 Unstructured data: This is arranged in sequences or series. The
next value depends on previous values. Examples are binary data,
encrypted data and media data such as audio, text, image and
videos.
What do I need to be good at to get into machine learning? ______
Machine learning combines four key subjects. To succeed, you must
know a lot of concepts that are way beyond average. Now, I don’t mean
you have to have knowledge up to the MSc or PhD levels. It should just
be above the average graduate.
The four subjects are:
 Domain knowledge
 Software engineering
 Data analysis
 Applied mathematics
- Domain Knowledge
Amateur data scientists take datasets on any topic and think they can
do portfolio projects on that topic. In reality, you ought to talk to
experts about the data. If it is a project about forecasting disease spread
for example, you ought to consult medical professionals and
environmental health experts. You also need to read publications, and
case studies to get a better idea on how to use your project is going to
impact the community. Research skills are very vital for progress in
machine learning.
- Software Engineering
You should know how to leverage data structures, algorithms, coding
patterns and architectural designs. Python programming is your primary
coding language here. You must know the necessary Python libraries and
frameworks that are relevant for your project. Of course if you want to
integrate and deploy models into applications and software services, you
must be a software developer.
ML engineers also team up with other engineers from other IT domains
such as cloud and DevOps engineers who can aid in model deployment.
This is why some firms desire ML engineers who have some experience
using tools like Docker, Apache services, cloud services like Azure and
AWS in their tech stack.
- Data Analysis
Contrary to popular notion, the bulk of machine learning is not
modelling, it is data analysis. In fact it is hard to become a full-fledged
engineer without gaining some data analysis certifications or work
experience. The good news is that data analysis is considerably an entry-
level role and every growing company wants one data analyst in their
workforce. The primary task of a data analyst in the workplace is to
make reports. Reports consists of tables, dashboards, visuals and KPIs
that are to be presented or published. Companies store their data in
relational databases, so it is important for a data analyst to know how to
write SQL statements to query the needed data. There is no way you can
make progress in machine learning without knowing how to write
queries. There is also no way you can advance without knowing how to
read and prepare the content of a report. A regular data analyst writes
his report on a Word document, an Excel workbook, a powerpoint slide.
But as you step into the world of machine learning, interactive Python
notebooks would become your favourite platform for reporting and
documentation.
- Applied Mathematics
You don’t need to be a great mathematician to enter machine learning
just like you don’t need to graduate with a music degree to become a
pop star. You just have to master “mathematical logic” because most
algorithms in the machine learning world are displayed as mathematical
formulae rather than pseudocode. The reason is because of the need for
the computation of values that represent the items in a data structure
such as a vector or tensor. By mathematical logic, I mean understanding
common mathematical notations and used in aggregation and
manipulation of figures. Without this skill, you would not know how to
do summarization, feature engineering and evaluation.
The three main branches of mathematics that you have to learn include:
- Statistics
 Aggregation:- Sums, products, averages, and variances are used in
aggregation of data. They form the basic building blocks for most
algorithms in machine learning.
 Probability: Probability laws, Shannon entropy, likelihood and
odds ratios are essential to understand how predictions work.
 Estimation statistics: Interval metrics, combinatorics, power
analysis and Bayesian formulations are essential to understand how the
accuracy of predictions work.
 Visualizations: Statistical reports are displayed as visualizations.
Python libraries like matplotlib, seaborn, and plotly are used in making
plots like line charts, bar charts, pie charts, histograms and cluster maps.
- Linear algebra
 Variables and Scalars: Features and values can be written as
variables which can be used to form equations or formula that take
dimensions and figures into account. Single values that are used to
modify vectors are called scalars.
 Sets and Vectors: In machine learning, a linear collection of values
is used to represent one feature. This is indicated mathematically as a
list or set. But in matrix terminology, this linear collection is called a
vector.
 Euclidean geometry: Euclidean geometry is necessary to
understand the effect of distances between data points.
 Matrices and Tensors: If a vector is a 1D array, then a 2D array is
called a matrix. Our dataframes and tables are actually matrices our
algorithms would manipulate. Any dimensions higher than 2D is called a
tensor.
- Calculus
 Functions: Calculus is the generalization of linear algebra to
measure rates of change. Your models are actually created using
function approximators. The model should take an input and return a
deterministic or stochastic output. Composite functions are the basis of
the layers of the neural networks used in deep learning.
 Differentiation: Principles like chain rule and partial derivatives
are the basis of gradient descent. Gradient descent is the predominant
algorithm that enables machines to learn patterns during the training
process.
 Integration: Integration is the basis of some calculations such as
continuous probability distributions.

Now to the main topic


 Machine Learning Project stages _____________________________
The steps are:
1. Project Definition
2. Data Acquisition
3. Data Preprocessing
4. Feature Engineering
5. Model Training
6. Model Evaluation
7. Model Deployment

One of the textbook ways newbies go through these steps is like this in
one or two python files:
1. Project definition: To show a tutorial or to learn machine learning and
add something to one’s portfolio.
2. Data acquisition: Download some dataset, usually a CSV file from
Kaggle. Then load the file locally using Pandas on an IDE OR A Jupyter
notebook.
3. Data preprocessing: Delete seemly irrelevant columns, fill empty
spaces with default values. Do some data visualization to observe the
distribution of data points.
4. Feature engineering: Apply one-hot encoding to categorical columns
and min-max scaling the numerical columns.
5. Training: Split the data into training and test sets. Then pick a column
to serve as a target. Then pick a suitable training algorithm to train the
data.
6. Evaluation: Then use existing accuracy metrics or residual metrics to
see how well the model performs on the test sets.
7. Deployment: If the results are satisfactory, save the model object into
a serializable file, then build a Streamlit or Gradio application and
integrate the model into the code. The user will input data and the app
will output the model’s predictions. Then host the app on a cloud service
like Streamlit Cloud, Render or Heroku.
This simplistic and superficial workflow is interesting to try out if you are
a newbie and works fine in good practice conditions. However, it is not
good enough to win a top competition or get a job in a top tech firm
today. It also lacks few features such pipelines, scalability and reusable
mechanisms which are needed for sustainable model development and
the app thereof. The steps to building a sustainable end-to-end
workflow is what we will address.

PROJECT DEFINITION
Your project definition should answer the following questions so as to be
clear about what to expect and how to provide solutions.
1. What is the overall objective?
2. Are we doing business, research or a hobby?
3. What tools are we going to use for the development?
4. What would be the source(s) of our data?
5. Do we have the resources to acquire and store enough data?
6. Should we build a new model or should we use a pre-trained model?
7. Where should we deploy the model?
8. Do we have the resources to update and maintain the model?

DATA ACQUISITION
Data is the raw material used in building ML models. Acquiring them is
an expensive and technical process.
Sources of Data
 Drive storage: Storage media like local or cloud storage are the
most direct ways for storing and accessing data.
 Online repository: You can get datasets from public repositories
like Kaggle or DatasetList. Note that the originality of the datasets from
these places is not verified by ALCOA standards and not recommended
for production projects.
 Surveys: Data about people can be acquired via online
questionnaires, feedback forms or surveys.
 Databases: Companies and institutions store data in databases.
Having access to databases can provide huge amounts of structured
data. It is recommended to use an ORM like SQLAlchemy to indirectly
interface with a database for big data operations.
 Cookies: If you have a website, you can use cookies to gain
information about the clients who click your website.
 Sensors or monitors: These gadgets are used to collect public data
from entities such as clicks, passing vehicles or animals.
 APIs: APIs are one of the best sources of data. Many informative
websites and services offer access to their APIs for a fee.
 Web scraping: You can get data from web pages using tools like
Selenium and other web crawlers. Web scraping is considered a
controversial technique. Web servers are equipped with blockers to
restrict web crawling activity.
 Data lakes and data warehouses: With the paid help of a data
engineer, you can get large quantities of data from data warehouses like
SnowFlake, AWS Redshift, and BigQuery. Data lakes like GCS and AWS S3
store lots of unstructured data.
 Data vendors: You can subscribe to an agent or vendor who
would source the fresh data you seek.
 Synthesis and augmentation. Synthetic data is data that is
artificially generated. Augmentation is the expansion of original data by
mixing it with synthetic ones or using other forms of feature engineering
to enlarge it. While they may not be original, synthetic data is used to
enable data to fit better on a particular distribution so that it can be
useful to train on.
While there are many ways of storing and acquiring data, keep in mind
concerns like storage, cost, accuracy, legality and security of the data
before making a decision.

DATA PREPROCESSING
Once the desired dataset is obtained, the first thing is to load the data
into memory and preprocess the data. The reason we preprocess data is
because fresh data from the real world is not perfect. If the data is not
treated first, it will corrupt the model. Preprocessing is also called data
cleansing or data wrangling. It is the removal of erroneous values or
inconsistencies and features that are either irrelevant or detrimental to
the model. The goal is to have a clean dataset that can be perfectly
feature engineered before training. Thus data preprocessing consists of
these activities:
- Removal of blank values
Also called null handling. Empty values will endanger the computations
during the feature engineering and training stages, so you have to fill the
empty spaces. Now, you just cannot manually fill the blank spaces with
arbitrary values, you have to apply imputation techniques, or use an
appropriate default value. You can also delete rows or records that have
nulls but you risk altering the distribution. If the entire dateset or feature
is overwhelmed with blank spaces, it would not be useful and have to be
discarded.
- Removal of erroneous values
Apart from nulls, there are occasions where some features may contain
misspellings, displacements among records or values having wrong
types. It is essential to summarize the data and observe these
inconsistencies and correct them otherwise they will confuse the model.
- Exploratory data analysis
Without good EDA, data preprocessing will go bad. This is the
examination of the features to understand the distribution and
relationship between the categorical and numerical data points using
visualizations or other descriptive and inferential techniques. To prevent
confusion, exploratory data analysis should be performed in three levels:
 Univariate analysis: Each feature is analysed independently.
 Bivariate analysis: Pairwise analysis of features.
 Multivariate analysis: Analysis of multiple features or groups of
features.
- Handling outliers
Outliers are data points that are so far from the median of the
distribution such that they skew the mean and variance of the
distribution thereby leading to inaccurate judgement of the results.
Outliers can be removed by deleting them or they can be shrunk using
feature scaling or eliminated by during training using regularization
algorithms.
- Handling imbalanced data
Sometimes there could be obvious imbalances among the distribution of
categorical values. If the machine learning model is to produce fair and
less biased outputs, imbalances ought to be handled. The Python library
imblearn provides solutions for handling imbalanced data. Before you
take on this venture, you should study statistical sampling techniques to
learn how to sample data correctly whilst taking the actual population
distribution into account, otherwise you would have errors. There are
two ways of treating imbalanced data:
 Oversampling: That is augmenting the quantities of the minority
class.
 Undersampling: That is reducing the quantities of the majority
class.
- Feature selection techniques
Feature selection is the removal of irrelevant or detrimental features
and keeping the ones that have significant impact in training the model.
Feature selection is first performed manually based on the parameters
requested by the project manager or client, or by the cardinality of the
feature values. If a feature set has no duplicate values such as the ones
used as primary keys, the feature would only add confusion and high
entropy into the model, so you have to remove it. Likewise, a feature set
that has only one homogeneous value throughout is redundant and
should be removed too.
However, there are cases where you have to select features based on
the statistics of their values, and that should not be done manually but
algorithmically. There are three methods of performing algorithmic
feature selection.
 Filter methods: These applies inferential statistical methods to the
features to selects the features that show the most effect. Examples
are the use of correlation matrix, F-test, chi-square test, mutual
information, and variance threshold.
 Wrapper methods: These are algorithms that train and test the
features on an existing machine learning algorithm and scores the
features based on their performance. Examples are recursive feature
elimination, sequential feature elimination, and exhaustive feature
elimination.
 Embedded methods: These methods are built into training
algorithms and exist as part of the underlying training process.
Examples are L1 and L2 regularization in regression and feature
importance metrics in decision trees.

FEATURE ENGINEERING
Feature engineering is the transformation of the data into arrays of
numerical values called feature vectors. The reason feature engineering
must come before training is because machine learning algorithms have
to see the data as arrays of numbers that need to be passed into special
functions. This basically means that strings and other non-numerical
values must be transformed into numerical representations.
This is the reason preprocessing comes before feature engineering. If the
data is not cleaned; it has redundant values, has blanks, has outliers and
other defects, the same problems will appear in the feature vectors and
lead to new problems during and after training.
Note that there can be a subjective overlap between preprocessing and
feature engineering, and both terms are sometimes used
interchangeably because they both involve modifying a dataset before
training. If you get these two stages wrong, your model will fail, so
understand the data you are working on well such that you will have an
idea on what your model should look like at the end, and how it ought to
behave.
There are different feature engineering techniques and algorithms
depending on the data and the project. Let’s look at the predominant
techniques:
- Imputation
This is the use of special algorithms called imputers to automatically fill
empty spaces with synthetic values that are intended to fit the
distribution.
- Feature extraction
Feature extraction should not be confused with feature selection. The
latter involves selecting and discarding features, and the former involves
creating, or deriving new features.
Feature extraction can be done manually by applying column-wise
calculations like the ones done in Excel. Binning or discretization is
another technique. It is the division of ranges of continuous values to be
mapped into nominal values called bins.
Algorithmically, feature extraction is any method that derives new
features computationally from existing ones.
- Categorical encoding
This is the translation of categorical values into discrete numerical
representations. Examples of algorithms used in categorical encoding
are one-hot encoder, label encoder, and ordinal encoder.
- Text vectorization
This is the translation of textual data or documents (called corpus) into
vectors based on the frequency and order of the words and other
symbols. There are underlying processes that occur during the
vectorization such as the extraction of tokens, lemmas and stop words
but that is beyond the scope of this note. There are basic text vectorizers
such as the count vectorizer, TF-IDF vectorizer, hash vectorizer and
dictionary vectorizer. But when it comes to advanced vectorizers used in
large-language models, there is Word2Vec, Glove and bag of words.
- Feature scaling
Feature scaling is sometimes called normalization or standardization
because it bears resemblance to the scenario where values of different
units are rescaled to fit into one unit. It is the transformation of the
original numerical values of a feature to fit a distribution where the
variance is low and the distribution is almost normal. In other words, the
range of the values are streamlined to form a smoother distribution. This
is necessary so that the model does not get confused because of the
variation amongst numbers. Common feature scaling algorithms include
min-max scaler, z-score scaler, robust scaler, IQR scaler and log
transformation.
- Audio, image and signal processing
Algorithms such as fourier transforms are used in processing signals and
waveform data. Rolling calculation windows are used to smoothen
indicators of financial series. The content of image data such as pixel
values, textures, and edges are translated into feature vectors using
libraries such as openCV, and torchvision before training them with
convolutional neural networks.
Dimensionality reduction
Dimensionality reduction is the compression or coalescence of several
features to form fewer features, usually, two or three. This is done such
that a lot of information about the original features is retained. This
makes it easier for the machine learning algorithm to train on multiple
features with fewer parameters. Most dimensionality reduction
algorithms are unsupervised (except LDA). There are two types of
dimensionality reduction techniques:
 Linear techniques: These assume that the relationships between
features are linear. Examples are PCA, LDA, and MDS.
 Non-linear techniques: These make no assumptions about linearity
or non-linearity among the features. They are useful for datasets
where features have very high dimensions. Examples are T-SNE,
UMAP, LLE, autoencoders and Isomap.
There are also feature engineering techniques that are embedded into
some training algorithms such as the convolutional layers of a CNN and
the attention heads of a transformer (in LLMs).

TRAINING THE MODEL


Correcting misconceptions first: What truly is a model?
When you learn about the term ‘model’ in machine learning jargon,
most laypeople and amateurs confuse it with the training algorithm.
Others confuse it with the overlying app or the outputs.
The word ‘model’ contextually bears similarity to the words ‘blueprint’,
‘framework’ or ‘method’. If I ask you to make a model that describes an
object, I expect you to make a 2D or 3D representation of the whole
object’s appearance and design. If I ask you to make a model of an
object that performs a task, what should come to your intelligent mind
are the possible materials, components and designs that must work
together to do the task. The components, designs, and materials when
working together makes a model.
In machine learning, a model comprises all of the datasets or data
structures, algorithms, and metrics working together to achieve a
desired intelligent outcome.
A model’s operation comprise these five steps:
1. It receives raw inputs
2. It applies some segregation, encoding or transformation on the input
values based on the feature engineering that was integrated into it or
learned.
3. Then it applies those encoded values to a learned function that serves
as estimator, regressor or classifier.
4. The learned function would generate a probabilistic or definite result.
5. The result is decoded by an inverse transformer and sent to the output
source.
Thus, the model is a program that receives inputs, processes them, and
gives an output. But unlike traditional programs where the outputs are
deterministic and directly proportional to the input due to the
parameters having been hard-coded by the programmer, machine
learning models are programmed to learn these parameters. However
the parameters are initialized by hyperparameters provided by the
programmer.
How does learning or training work?
To train a model is to provide hyperparameters, and the machine learns
by using the hyperparameter values to generate parameters for
approximation.
Mathematically, algorithms are used to apply one or more function
approximations to the arrays or feature vectors. As the term,
‘approximation’ implies, it attempts to construct the best function that
will generalize all of the data points and their relative positions. In other
words, it solves the problem of finding the shortest and most efficient
pathways or demarcations across all the data points that make up the
distribution. The approximations reach the limit and stop when this
function is discovered. But there is a problem. How the approximations
start depends on the data and the provided hyperparameters. If the data
was not properly wrangled and encoded, or the wrong hyperparameter
values were set, the wrong parameters would be learned. The result is a
bad function that cannot fit well and subsequent erroneous outputs.
Training models is not easy so do not take any of the requirements
lightly!
Categorizing models in machine learning
There are different ways of categorizing models in machine learning
methodology jargon.
- Types of learning
There are two main types of learning: supervised and unsupervised
learning. However there are other approaches like reinforcement
learning and semi-supervised learning.
 Supervised learning: This is using features as predictor variables
to affect and predict known target vectors. Most regressor and classifier
algorithms are supervised.
 Unsupervised learning: This is discovering relationships or
patterns among features without specifying a target vector. Instead the
results of the training are used to make estimations or comparisons,
hence the predictions. Many clustering algorithms and dimensionality
reduction algorithms prefer this approach.
- Method of approximation
There are two methods of learning when it comes to approaching the
data and making approximations: model-based learning and instance-
based learning.
 Model-based learning: This is the method of learning that is
closest to the explanation in the section “How does learning or training
work?” and it is the most preferred method today. Many classifiers and
regressors use this method. In this method, the model is formed after
training on the data and works independent of the data afterward. The
most preferred approximating algorithm used to train the model is
called gradient descent. Gradient descent tries solve something called
the loss function by reducing the error margins between the actual
values and possible predicted values. The training time is often long but
the prediction time is short. However, the model can only be updated by
training afresh on new data. It is as good as building a new model.
 Instance-based learning: Also called non-parametric learning
because there are no internal parameters to approximate. Many cluster-
based algorithms and some dimensionality reduction algorithms are
instance based. In this method, the dataset is part of the model because
its data points are used to provide inference through similarity and
distance measures. To make accurate predictions, an instance of the
data points of the test data are compared with those of the training data
and the distance algorithm approximates how similar or close both are.
Hence, instance-based methods have shorter training time but longer
prediction time. This makes them useful in situations where the training
data is regularly updated, and so the burden of training new models is
avoided. Projects involving anomaly and fault detection are more
suitable using instance-based models.
- Prediction type
There are two main types of algorithms for predictive analysis in
machine learning: classification and regression. In scikit-learn
terminology, the algorithms used for predictions are called classifiers
and regressors respectively. In pipeline terminology, both are called
estimators.
 Classification: Classification is predicting categorical outcomes or
labels. Classifiers generally work by drawing a line or plane to demarcate
data points so that related data points would have their own space and
labels.
 Regression: Regression is the estimation of continuous values. The
philosophy of regression is based on statistical linear regression except
that in machine learning it is expanded to include curves, shapes, and
hue with possible regularization so far the sufficient predictor variables
are provided. Thus, most regression is supervised.
- The main groups of machine learning algorithms
There are several groups of machine learning algorithms that would be
discussed here. Note that a table fully classifying these algorithms would
be done in a different document:
 Linear models: They use linear methods to understand patterns in
the data and make predictions. Examples are the family of GLM variants,
logistic regression, MDS, LDA, PCA, and SVM.
 Clustering: They group data points into distinct clusters. Examples
of clustering algorithms include K-Means, KNN, GMM, OPTICS, and
hierarchical clustering.
 Manifold Learning: They are mainly used for dimensionality
reduction but they are non-linear techniques. Examples are T-SNE,
UMAP, LLE, and Isomap.
 Bayesian models: They are based on Bayes rule and as a result are
probabilistic techniques. A popular algorithm is Naive Bayes.
 Decision trees: The decision tree is an algorithm that tries to split
the data using binary tree structure. It is used both for classification and
regression. On its own the decision tree is limited but can be expanded
into ensemble models like random forest and variants of gradient
boosts.
 Neural networks: A popular complex algorithm used in AI
projects. It consists of a graph of layers of perceptrons that connect an
input layer to an output layer. A neural network trains by solving
gradient descent using a combination of two iterative processes called
forward propagation and backpropagation. There are many types of
neural networks. Examples are RBFNN, FFNN, RNN, CNN, SOM, RBM,
autoencoders and transformers.
 Ensemble models: They are models or algorithms that expand
into collections and culminate into an ensemble to make predictions.
Bagging and boosting are the two methods for forming ensembles.
Random forest and variants of gradient boost like Ada Boost, XGBoost,
LightBoost and CatBoost are elegant examples of ensembles using
several decision trees.
 Stack generalization: Another ensemble method is model stacking
and blending where different algorithms train on the same training
dataset in parallel and form a new dataset with their predictions. This
new dataset is trained on another algorithm to form a ‘meta-model’ that
can make complex yet accurate predictions.
 Reinforcement learning: Reinforcement learning is rewarding a
model when it makes a good prediction during training and penalizing it
for making bad predictions so that over time, it only outputs what the
user or programmer desires and avoids unwanted predictions.
Reinforcement learning is used to train artificial intelligence applications
to promote safety and ethics. An examples of reinforcement learning
methods is Q-learning.
- Splitting the data
The general procedure for preparing a model immediately after getting
the data is to split it into three datasets: training, validation, and testing
sets.
 Training dataset: A huge chunk of the data set up to 70 ~ 80% is
used for training. This is the set that we work from preprocessing
up to training. It prepares a prototype model for evaluation.
 Validation dataset: This is a portion of the training set that is used
for testing the overall accuracy of the model after training.
 Testing dataset: The test set is a small portion of the original data
that is raw and not altered with. It is used to finally test every
metric of the prototype model.
- Challenges of modelling
Even though I stated that training is the least of the work, that does not
mean it is so easy. However if any of the preceding stages were
implemented correctly, a lot of problems could be avoided.
 Overfitting and underfitting: Both are called the bias-variance
tradeoff. The aim of the training is to approximate the optimum
parameters that minimizes the least error. It is like a tailor who
makes a dress for you. The entire outfit should fit well on you if it
was sown correctly. Overfitting is the situation where the
parameters overshoot the distribution of the training data and
fails to achieve generalization. As a result the model fails on the
test data. The model is said to exhibit high variance. On the other
hand, in underfitting, the model is said to exhibit low variance and
very high bias, and as a result the training is not thorough and few
parameters are learned too quickly.
 Data leakage: This is the situation where some implicit or explicit
information from your testing data finds its way into the training
data. During training, the model would use that information to
cheat and acquire seemingly good results. The outcome is that the
model would only rely on those values to make predictions
instead of actual generalization. To avoid this, do not preprocess
the test data and remove features that would cause leakage.
 Distribution drift: This happens when the distribution of the
training, validation and test data do not align. As a result the
model makes errors due to drift. It is wise to perform power
analysis on all datasets after splitting to avoid this problem.
 Multicollinearity: This is the situation where so many features are
too correlated. And this can cause leakage and the bias-tradeoff
problem. Add more less correlated features or use binning to
solve the problem.
 High cardinality: This is the situation where there are so many
categorical values, leading to high entropy during training. Use
ensemble models, stacked models or regularization technique to
overcome thid problem.
- Optimization techniques
Optimization techniques aim to solve the problems previously discussed.
They intend to improve model performance. As a newbie, it is alright to
practice using one-off shallow training and testing. You could still get
good results but in the real world, datasets are large, diverse and messy,
so simplistic training would not work.
 Regularization: This is the automatic penalizing or constraining of
some parameters during training to prevent overfitting.
Regularization speeds up training time because it helps to weed
out unnecessary parameters and focus on the ones that are mor
significant on the model. Examples are pruning in decision trees,
L1 and L2 (and elastic net) regularization in regression and neural
networks, early stopping, dropout and batch normalization in
deep learning are examples of regularization techniques in deep
learning. Deep learning models often suffer from exploding and
varnishing gradient problems and so they need a lot of
regularization.
 Cross-validation: This is the automatic segmentation of the
training data into different folds for training and testing. The
algorithm will iterate through the folds several times hence
ensuring that every single segment of the data is trained on and
validated properly. Examples of techniques are K-Fold, stratified
K-Fold and leave-P-out cross-validation techniques.
 Hyperparameter tuning: It is difficult for programmers and
engineers to know the right hyperparameter values to set before
training. It gets worse if there are so many different
hyperparameters in the model they want to build. The automatic
solution to this is called hyperparameter optimization or
hyperparameter tuning. The programmer will enlist various
hyperparameter values and train the model on all of them using
high dimensional optimization algorithms like GridSearchCV and
RandomizedSearchCV. Once training is complete, the optimizer
would display the set of hyperparameter values that performed
best. GridSearchCV is slower because it puts all the
hyperparameter values on a grid and runs them one by one.
RandomizedSearchCV is not as slow as GridSearchCV but it may
not give better results because randomly picks the
hyperparameters and trains on most of them.
- Transfer learning
This is the reuse of already deployed pre-trained models to train on new
data. Pre-trained models like GPT, BERT, VGG and R-CNN are provided
by both open-source and propriety services with access to the model’s
hyperparameters for programmers who do not have the resources of
training a new model from scratch.
MODEL EVALUATION
Before you train a model, it is wise to set a baseline or accuracy
threshold based on the findings obtained during EDA. If the trained
model widely falls short of this baseline or widely exceeds it, you can be
sure that you have a bias-variance trade-off problem. Note that as far as
statistical principles are concerned, if the accuracy of your model is
absolutely 100%, then it is overfit.
After training, the next stage is evaluation of the model’s performance. If
you did not use cross-validation, you should have used a holdout
validation set to evaluate the model’s parameters. Recall that the
parameters are what are learned through approximation algorithms like
gradient descent and similarity measures. Given that the predicted
output may be of classification or regression, it is essential to know the
right metrics to use after the training process elapses.
- R-squared
R-squared is a family of statistical metrics used in evaluating the cost
function of regression analysis and machine learning. R stands for
residual error and it refers to the actual value subtracted from the
predicted value. To the degree that the residual is closest to zero, we
have the confidence that the model learned well. The approximated loss
function is the square or absolute values of the residuals. The cost
function is the aggregation of the loss function for all residuals.
Examples of cost functions for various regression tasks include MSE,
RMSE, MAE, MAPE, Huber loss and log-cosh loss. Hinge loss is the
primary loss function for classification in SVM.
- Entropy-based metrics
The cost functions used for classification tasks are mainly based on
calculating entropy or information gain. Examples are multi-class cross-
entropy loss, binary cross-entropy loss (log loss), KL divergence, and
mutual information.
- Similarity and distance measures
Clustering is so essential in machine learning and spatial analysis
because the primary visual used to map data points is the scatter plot.
Indeed, it is one of the first visuals you will use as a beginner. There are
clustering models for instance-based learning and prediction analysis.
Most dimensionality reduction processes lead to cluster formations.
Examples of similarity metrics used to measure the properties or
qualities of clusters include cosine similarity matrix, Dunn index, and
silhouette scores. Moreover the formulae for calculation of distances
between clusters include euclidean, manhattan, minkowski and
mahalanobis distance.
- Classification report
The classification report is a family of formulae that are used in
evaluating the accuracy of the binary classification model on the test set.
A classification report consist of four parameters:
 True positives: The proportion of correctly predicted positive
values.
 True negatives: The proportion of correctly predicted negative
values.
 False positives: The proportion of incorrectly predicted positive
values.
 False negatives: The proportion of incorrectly predicted negative
values.
The most preferred visual used for displaying these values is the
confusion matrix. Ideally, you want to have far more true positives and
true negatives over their false counterparts but it depends on the
project objectives. The basic formulae used for judging the rates of
correct values include precision, recall or sensitivity, and specificity.
More robust formulae used in the classification report that attempts to
take all concerns into account include accuracy score, F1 score, ROC-
AUC curve, and MCC matrix.
- AIC and BIC
AIC stands for Akaike information criterion wile BIC stands for Bayesian
information criterion. The formula is based on maximum likelihood
estimation in probability. They are both used to judge the complexity of
models by applying penalties to the parameters. But BIC applies stronger
penalties to more complex models. It is a metric that is used in a lot of
applications. Among them include:
 Linear regression: Lasso (L1) and Ridge (L2) regression are
regularization techniques that work by using AIC and BIC to optimize
weights during regression training. Elastic Net regression is a
combination of both Lasso and Ridge regression.
 Feature selection: Some wrapper methods use AIC or BIC to weed
out problematic features.
 Polynomial regression: They act as hyperparameter optimizers for
polynomial regression. They aim to find the best degree that would plot
the curve.
 Guassian mixture models: They also act as hyperparameter
optimizers for GMMs. They aim to find the optimum number of clusters
that would generalize the distribution.
- Model calibration
This is the measure of the distribution of the predicted values over the
actual values of the test data. Calibration is necessary to detect
distribution drift so that future wrong predictions can be avoided. You
can use it to check if your predictions fall around the set baseline.
Methods used for calibration include Platt scaling, isotonic regression
and temperature scaling. Metrics used to measure calibration include
Brier score, ECE, NLL and calibration curve.

MODEL DEPLOYMENT
After building a model, you ought to deploy them. Inexperienced
developers are stuck on Jupyter environments and find difficulty in
deploying and integrating their models. Model deployment is the end-
to-end integration of models into computer applications or
infrastructure to give them analytical, artificial intelligence, or any
machine learning capabilities. Recall that the model is a program. But
that program simulates reason. Unlike traditional software where every
internal component can be understood since they were hard coded,
machine learning models are often perceived to be black boxes due to a
factor called model complexity. A model’s complexity comprises
properties like the size and dimensions of the feature vector, the
number of features learned, and the number of parameters and
hyperparameters involved. As a result machine learning models can
scale from being as small as a simple linear regression model that needs
just three features of a house to predict its price to incredibly huge
models like DALL-E 3 or GPT 4. It becomes overwhelming when it comes
to the imagining the scale of artificial general intelligence.
Today there are existing frameworks and services that have been
provided to deploy models. Due to the complexity of applications and
infrastructure today, it is difficult and cumbersome to work solo. You
have to team up with other engineers to get the job done. This is the
reason why MLOps engineering is a considered the highest senior role
within the machine learning community because it extends ML
engineering to include other senior infrastructure roles.
- Distributed data processing
Surely, we cannot start talking about model deployment without
discussing scalable data acquisition and processing. In this section we
will briefly discuss data ingestion and big data processing.
Data ingestion is the technical process of segmenting and processing
incoming data from a live source. There are two types of ingestion
processes: batch and stream processes.
 Batch processing: This is the segmentation and processing of
incoming data in batches or chunks at regular intervals. This is
done to save memory and account for latency. Tools like Apache
Spark and Airflow are used for scalable batch processing.
 Stream processing: This is the processing of bytes incoming data
as it comes; called streams. Live feeds are examples of streaming
processes. Stream processing can be achieved using tools like
Apache Kafka.
Alternative frameworks for distributed data processing include Ray and
Dask. But Dask fits better within the Python ecosystem. It is specifically
designed to bring scalable and parallel data processing where
foundational libraries like numpy and pandas fall short.
- Pipelines
A pipeline is a concept associated with ETL and CICD terms in end-to-end
terminology. It is a framework of mechanical stages where data or code
is ingested, stored and processed or vice-versa before reaching its
subscriber or client. In ML engineering, models can be programmed as
pipelines or integrated into an MLOps workflow. Contextually, there are
two kinds of pipelines:
 Model pipelines: This is the structured chain of stages data has to
pass through from loading, transformation, and training before
serialization. In this context, it is not just the trained data that
becomes the model, the entire chain is the model and can be
serialized. The scikit-learn library offers modules for building
unified model pipelines and transformers.
 E2E pipelines: E2E stands for end-to-end in DevOps terminology.
In this case, it is a broader term describing the sequence of CICD
operations beginning from ingestion to model training to serving.
Pipelines are build to automate deployment processes at scale.
Engineers design them such that they can be improved and
maintained.
- Model serialization
Serialization is the conversion of information into a secure binary file
format. Serialization makes it easier to integrate models and move them
across environments. Common file formats for serializing models include
onnx, pickle, joblib, hdf5, and torchscript.
- Model registry and experiment tracking
In the real world, many machine learning projects are just experimental,
and many do not make it to production. You can build more models for
the same project, and you would need to track the versions, pipelines,
metrics and performance of your model experiments. You cannot do this
manually using tools like spreadsheets and git. Model registry platforms
are prepared for the job. Such a platform should be able to do the
following:
 Experimentation: The platform should be able to track instances
of model runs, including the metrics, hyperparameters, and
artifacts.
 Dataset and model versioning: The platform may support
versioning of models and datasets. Git-based platforms like DVC
support dataset versioning.
 Pipeline integration: The platform should support the automation
and deployment of either model pipelines or E2E pipelines or
both.
 Model registry: The platform should have a repository for models,
including their metadata and track their lifecycle.
 Reproducibility: The platform should ensure that the model can
be retrained and used with the same conditions.
 Collaboration: The registry should be open for other team
members or persons to have access to the models, and could
share and update them. Ensure that there is an added layer of
security over the registry.
 Deployment: The platform must provide options for serialization
and deployment.
There are many registry and experiment logging platforms. Examples
include MLFlow, DVC, ZenML, ClearML, Comet, Neptune, Seldon and
WAB. Note that all of these frameworks have their benefits and
limitations. Research well before picking one for your project.
- Model serving and monitoring
Model serving is the hosting of models on designated server platforms
for subscribers to use. Also monitoring and maintenance systems are
also implemented to provide security and automation. A model serving
platform should offer these key features:
 Inference: The platform must be able to collect user requests and
quickly return model outputs using provided accelerators and
APIs.
 Auto-scaling: The platform must be able adjust the number of
server instances based on demand. Not all platforms have native
support.
 Multi-framework support: The platform should support one or
more frameworks.
 Model versioning: The platform should be able to track model
versions.
 Caching support: The platform must provide a caching service. It is
unwise to deploy models without cache because it would lead to
inefficiencies for the user.
 Monitoring and logging: The platform must be able to monitor
and log the model's activities, and ensure consistency of its
performance.
 Deployment environment: The platform must be based on a
deployment environment. The most preferred platforms are
Docker, Kubernetes and any cloud service such as AWS, Azure and
GCP.
Examples of the many model serving providers include BentoML, Seldon
Core, KServe, Nvidia Triton Inference Server, Amazon SageMaker,
TorchServe, and TensorFlow serving. Each of these platforms have their
advantages and limitations. Research well before you use them for your
project.
- Deployment on web applications
Below are a few open-source frameworks for deploying and hosting
projects on the web. You could do a solo portfolio project using these
frameworks.
 Streamlit: A low-code framework designed for AI and analytical
applications. No need for reinventing the front-end. The app can
be deployed on Streamlit cloud service. Streamlit comes with its
own caching components so do not worry about bringing in
external caching system.
 Gradio: Another low-code framework designed for building
computer vision applications to be deployed on HuggingFace.
 Dash: A full-stack web framework built on React and Flask and
provided by Plotly Inc. Dash is specifically designed for building
front-end analytical and geo-analytical applications using visuals
from the Plotly library. Dash supports multiple caching methods
including memoization. There is the proprietary version called
Dash Enterprise that comes with advanced theming features. The
open source version has been extended to interface with
bootstrap, mantine and other community frameworks. Dash does
not have its own native place for hosting apps. But Heroku and
Render are some of the recommended places.
 Summary and key takeaways _______________________________
1. Machine Learning is hard.
2. Machine learning algorithms are represented as formula, not
pseudo-code.
3. Pick the appropriate tools, and not just any tool for your project.
4. There are two types of data: categorical and numerical.
5. There are three basic data formats: structured, semi-structured
and unstructured data.
6. To get deep into machine learning, you ought to have domain
knowledge, software development and data analysis skills. You
must also embrace mathematics.
7. It is wise to comply with ALCOA standards when sourcing data.
8. Preprocessing is data wrangling or data cleansing.
9. A good EDA would guide your decisions and give you a good
picture of how your model should look like.
10. EDA is in three levels: univariate, bivariate, and multivariate
analysis.
11. A scalar is a single value that is used to modify the items in an
array.
12. A 1D array is a vector, a 2D array is a matrix. Beyond 2D is called a
tensor.
13. Feature engineering transforms the dataset into feature vectors
before training.
14. There are two types of learning: supervised and unsupervised
learning.
15. There are two approaches to estimation or approximation:
model-based and instance-based learning.
16. There are two types of predictions: classification and regression.
17. A model is a program where the parameters are learned on the
data using function approximation.
18. Highly complex models are black boxes.
19. Hyperparameters are set before training begins.
Hyperparameters influence the model’s parameters.
20. A model stops training after it has found the optimum parameters
that generalizes the data.
21. Training can be improved using regularization, cross-validation
and hyperparameter tuning techniques.
22. Cost functions, classification report, and calibration are basic
metrics for evaluating models after training.
23. MLOps is the highest level role comprising ecosystems, machine
learning life cycle and end-to-end development.
24. Models themselves can be serializable pipelines, and can be
integrated into CICD or ETL pipelines or workflows.

You might also like