Steps ML
Steps ML
Forward ________________________________________________
The steps that are going to be discussed in this article are the same for
any machine learning project but the suitable tools to use vary
depending on the data and the project objectives. So if you are working
on customer churn, you cannot use the tools designed for image
classification. If you are doing financial time series forecasting, you
cannot use logistic regression to predict future prices.
Tools and techniques are grouped into different categories, just like
there might be different screwdrivers and spanners in one toolbox.
Some tools perform better than others for the same job but engineers
are advised to choose tools based on the data and what works for the
project and not what they fancy. A knife can be used to cut a small wire
but a more appropriate tool in the toolbox for the job is a plier. Likewise
there are various algorithms for feature selection, and clustering, and
cost functions for regression but you cannot just pick anything you like
to build and deploy your model.
The keyword here is toolbox. Over 80% of data science projects fail
because so-called yet inexperienced “data scientists” don’t get
comprehensive education and training on how to go about the craft or
science. To put it bluntly, machine learning is very hard. Machine
learning as a career is also relatively new and there would be few
experienced ML engineers for a while to provide mentorship until
traditional academia generally catches up. To truly get into machine
learning today with certifications, you have to combine domain
knowledge, software engineering, data analysis, and applied
mathematics.
What is data? ___________________________________________
Statistics is the science of understanding data. Data consists of units of
information that is organized into a certain format or structure. There
are two main types or branches of data (or data types) in statistics and
software.
- Categorical data
This is data that consists of labels. There are two branches of categorical
data:
Nominal data: These comprises discrete names such as nouns,
addresses, and classes.
Ordinal data: These comprises labels with rankings or contrasts.
Examples are star ratings, experience and difficulty levels and
traffic colour codes.
- Numerical data
This is data that consists of numbers. There are two branches of
numerical data:
Discrete values: These comprise discrete numbers that are part of
an entity but do not necessarily indicate measurement. Examples
are the numbers on a die, HTTP status codes, and a number
generated as a hash.
Continuous values: These comprise numbers that indicate
measurement, range or extent. Examples are length, pressure,
temperature, population, sum of money, and exam scores.
There are three main types of digital data formats: structured, semi-
structured and unstructured data.
Structured data: This is tabular data. It constitutes a set of values
organized into rows and columns. The rows are normally indicated
by indexing and the columns are marked by labels called features
or attributes. Every row is a record of an entity’s features’ values
at the time of acquisition. Examples are pandas dataframes, SQL
tables and Excel spreadsheets.
Semi-structured data: This is organized to form a tree structure
with records separated by delimiters or tags. Examples are JSON,
CSV, XML and HTML data.
Unstructured data: This is arranged in sequences or series. The
next value depends on previous values. Examples are binary data,
encrypted data and media data such as audio, text, image and
videos.
What do I need to be good at to get into machine learning? ______
Machine learning combines four key subjects. To succeed, you must
know a lot of concepts that are way beyond average. Now, I don’t mean
you have to have knowledge up to the MSc or PhD levels. It should just
be above the average graduate.
The four subjects are:
Domain knowledge
Software engineering
Data analysis
Applied mathematics
- Domain Knowledge
Amateur data scientists take datasets on any topic and think they can
do portfolio projects on that topic. In reality, you ought to talk to
experts about the data. If it is a project about forecasting disease spread
for example, you ought to consult medical professionals and
environmental health experts. You also need to read publications, and
case studies to get a better idea on how to use your project is going to
impact the community. Research skills are very vital for progress in
machine learning.
- Software Engineering
You should know how to leverage data structures, algorithms, coding
patterns and architectural designs. Python programming is your primary
coding language here. You must know the necessary Python libraries and
frameworks that are relevant for your project. Of course if you want to
integrate and deploy models into applications and software services, you
must be a software developer.
ML engineers also team up with other engineers from other IT domains
such as cloud and DevOps engineers who can aid in model deployment.
This is why some firms desire ML engineers who have some experience
using tools like Docker, Apache services, cloud services like Azure and
AWS in their tech stack.
- Data Analysis
Contrary to popular notion, the bulk of machine learning is not
modelling, it is data analysis. In fact it is hard to become a full-fledged
engineer without gaining some data analysis certifications or work
experience. The good news is that data analysis is considerably an entry-
level role and every growing company wants one data analyst in their
workforce. The primary task of a data analyst in the workplace is to
make reports. Reports consists of tables, dashboards, visuals and KPIs
that are to be presented or published. Companies store their data in
relational databases, so it is important for a data analyst to know how to
write SQL statements to query the needed data. There is no way you can
make progress in machine learning without knowing how to write
queries. There is also no way you can advance without knowing how to
read and prepare the content of a report. A regular data analyst writes
his report on a Word document, an Excel workbook, a powerpoint slide.
But as you step into the world of machine learning, interactive Python
notebooks would become your favourite platform for reporting and
documentation.
- Applied Mathematics
You don’t need to be a great mathematician to enter machine learning
just like you don’t need to graduate with a music degree to become a
pop star. You just have to master “mathematical logic” because most
algorithms in the machine learning world are displayed as mathematical
formulae rather than pseudocode. The reason is because of the need for
the computation of values that represent the items in a data structure
such as a vector or tensor. By mathematical logic, I mean understanding
common mathematical notations and used in aggregation and
manipulation of figures. Without this skill, you would not know how to
do summarization, feature engineering and evaluation.
The three main branches of mathematics that you have to learn include:
- Statistics
Aggregation:- Sums, products, averages, and variances are used in
aggregation of data. They form the basic building blocks for most
algorithms in machine learning.
Probability: Probability laws, Shannon entropy, likelihood and
odds ratios are essential to understand how predictions work.
Estimation statistics: Interval metrics, combinatorics, power
analysis and Bayesian formulations are essential to understand how the
accuracy of predictions work.
Visualizations: Statistical reports are displayed as visualizations.
Python libraries like matplotlib, seaborn, and plotly are used in making
plots like line charts, bar charts, pie charts, histograms and cluster maps.
- Linear algebra
Variables and Scalars: Features and values can be written as
variables which can be used to form equations or formula that take
dimensions and figures into account. Single values that are used to
modify vectors are called scalars.
Sets and Vectors: In machine learning, a linear collection of values
is used to represent one feature. This is indicated mathematically as a
list or set. But in matrix terminology, this linear collection is called a
vector.
Euclidean geometry: Euclidean geometry is necessary to
understand the effect of distances between data points.
Matrices and Tensors: If a vector is a 1D array, then a 2D array is
called a matrix. Our dataframes and tables are actually matrices our
algorithms would manipulate. Any dimensions higher than 2D is called a
tensor.
- Calculus
Functions: Calculus is the generalization of linear algebra to
measure rates of change. Your models are actually created using
function approximators. The model should take an input and return a
deterministic or stochastic output. Composite functions are the basis of
the layers of the neural networks used in deep learning.
Differentiation: Principles like chain rule and partial derivatives
are the basis of gradient descent. Gradient descent is the predominant
algorithm that enables machines to learn patterns during the training
process.
Integration: Integration is the basis of some calculations such as
continuous probability distributions.
One of the textbook ways newbies go through these steps is like this in
one or two python files:
1. Project definition: To show a tutorial or to learn machine learning and
add something to one’s portfolio.
2. Data acquisition: Download some dataset, usually a CSV file from
Kaggle. Then load the file locally using Pandas on an IDE OR A Jupyter
notebook.
3. Data preprocessing: Delete seemly irrelevant columns, fill empty
spaces with default values. Do some data visualization to observe the
distribution of data points.
4. Feature engineering: Apply one-hot encoding to categorical columns
and min-max scaling the numerical columns.
5. Training: Split the data into training and test sets. Then pick a column
to serve as a target. Then pick a suitable training algorithm to train the
data.
6. Evaluation: Then use existing accuracy metrics or residual metrics to
see how well the model performs on the test sets.
7. Deployment: If the results are satisfactory, save the model object into
a serializable file, then build a Streamlit or Gradio application and
integrate the model into the code. The user will input data and the app
will output the model’s predictions. Then host the app on a cloud service
like Streamlit Cloud, Render or Heroku.
This simplistic and superficial workflow is interesting to try out if you are
a newbie and works fine in good practice conditions. However, it is not
good enough to win a top competition or get a job in a top tech firm
today. It also lacks few features such pipelines, scalability and reusable
mechanisms which are needed for sustainable model development and
the app thereof. The steps to building a sustainable end-to-end
workflow is what we will address.
PROJECT DEFINITION
Your project definition should answer the following questions so as to be
clear about what to expect and how to provide solutions.
1. What is the overall objective?
2. Are we doing business, research or a hobby?
3. What tools are we going to use for the development?
4. What would be the source(s) of our data?
5. Do we have the resources to acquire and store enough data?
6. Should we build a new model or should we use a pre-trained model?
7. Where should we deploy the model?
8. Do we have the resources to update and maintain the model?
DATA ACQUISITION
Data is the raw material used in building ML models. Acquiring them is
an expensive and technical process.
Sources of Data
Drive storage: Storage media like local or cloud storage are the
most direct ways for storing and accessing data.
Online repository: You can get datasets from public repositories
like Kaggle or DatasetList. Note that the originality of the datasets from
these places is not verified by ALCOA standards and not recommended
for production projects.
Surveys: Data about people can be acquired via online
questionnaires, feedback forms or surveys.
Databases: Companies and institutions store data in databases.
Having access to databases can provide huge amounts of structured
data. It is recommended to use an ORM like SQLAlchemy to indirectly
interface with a database for big data operations.
Cookies: If you have a website, you can use cookies to gain
information about the clients who click your website.
Sensors or monitors: These gadgets are used to collect public data
from entities such as clicks, passing vehicles or animals.
APIs: APIs are one of the best sources of data. Many informative
websites and services offer access to their APIs for a fee.
Web scraping: You can get data from web pages using tools like
Selenium and other web crawlers. Web scraping is considered a
controversial technique. Web servers are equipped with blockers to
restrict web crawling activity.
Data lakes and data warehouses: With the paid help of a data
engineer, you can get large quantities of data from data warehouses like
SnowFlake, AWS Redshift, and BigQuery. Data lakes like GCS and AWS S3
store lots of unstructured data.
Data vendors: You can subscribe to an agent or vendor who
would source the fresh data you seek.
Synthesis and augmentation. Synthetic data is data that is
artificially generated. Augmentation is the expansion of original data by
mixing it with synthetic ones or using other forms of feature engineering
to enlarge it. While they may not be original, synthetic data is used to
enable data to fit better on a particular distribution so that it can be
useful to train on.
While there are many ways of storing and acquiring data, keep in mind
concerns like storage, cost, accuracy, legality and security of the data
before making a decision.
DATA PREPROCESSING
Once the desired dataset is obtained, the first thing is to load the data
into memory and preprocess the data. The reason we preprocess data is
because fresh data from the real world is not perfect. If the data is not
treated first, it will corrupt the model. Preprocessing is also called data
cleansing or data wrangling. It is the removal of erroneous values or
inconsistencies and features that are either irrelevant or detrimental to
the model. The goal is to have a clean dataset that can be perfectly
feature engineered before training. Thus data preprocessing consists of
these activities:
- Removal of blank values
Also called null handling. Empty values will endanger the computations
during the feature engineering and training stages, so you have to fill the
empty spaces. Now, you just cannot manually fill the blank spaces with
arbitrary values, you have to apply imputation techniques, or use an
appropriate default value. You can also delete rows or records that have
nulls but you risk altering the distribution. If the entire dateset or feature
is overwhelmed with blank spaces, it would not be useful and have to be
discarded.
- Removal of erroneous values
Apart from nulls, there are occasions where some features may contain
misspellings, displacements among records or values having wrong
types. It is essential to summarize the data and observe these
inconsistencies and correct them otherwise they will confuse the model.
- Exploratory data analysis
Without good EDA, data preprocessing will go bad. This is the
examination of the features to understand the distribution and
relationship between the categorical and numerical data points using
visualizations or other descriptive and inferential techniques. To prevent
confusion, exploratory data analysis should be performed in three levels:
Univariate analysis: Each feature is analysed independently.
Bivariate analysis: Pairwise analysis of features.
Multivariate analysis: Analysis of multiple features or groups of
features.
- Handling outliers
Outliers are data points that are so far from the median of the
distribution such that they skew the mean and variance of the
distribution thereby leading to inaccurate judgement of the results.
Outliers can be removed by deleting them or they can be shrunk using
feature scaling or eliminated by during training using regularization
algorithms.
- Handling imbalanced data
Sometimes there could be obvious imbalances among the distribution of
categorical values. If the machine learning model is to produce fair and
less biased outputs, imbalances ought to be handled. The Python library
imblearn provides solutions for handling imbalanced data. Before you
take on this venture, you should study statistical sampling techniques to
learn how to sample data correctly whilst taking the actual population
distribution into account, otherwise you would have errors. There are
two ways of treating imbalanced data:
Oversampling: That is augmenting the quantities of the minority
class.
Undersampling: That is reducing the quantities of the majority
class.
- Feature selection techniques
Feature selection is the removal of irrelevant or detrimental features
and keeping the ones that have significant impact in training the model.
Feature selection is first performed manually based on the parameters
requested by the project manager or client, or by the cardinality of the
feature values. If a feature set has no duplicate values such as the ones
used as primary keys, the feature would only add confusion and high
entropy into the model, so you have to remove it. Likewise, a feature set
that has only one homogeneous value throughout is redundant and
should be removed too.
However, there are cases where you have to select features based on
the statistics of their values, and that should not be done manually but
algorithmically. There are three methods of performing algorithmic
feature selection.
Filter methods: These applies inferential statistical methods to the
features to selects the features that show the most effect. Examples
are the use of correlation matrix, F-test, chi-square test, mutual
information, and variance threshold.
Wrapper methods: These are algorithms that train and test the
features on an existing machine learning algorithm and scores the
features based on their performance. Examples are recursive feature
elimination, sequential feature elimination, and exhaustive feature
elimination.
Embedded methods: These methods are built into training
algorithms and exist as part of the underlying training process.
Examples are L1 and L2 regularization in regression and feature
importance metrics in decision trees.
FEATURE ENGINEERING
Feature engineering is the transformation of the data into arrays of
numerical values called feature vectors. The reason feature engineering
must come before training is because machine learning algorithms have
to see the data as arrays of numbers that need to be passed into special
functions. This basically means that strings and other non-numerical
values must be transformed into numerical representations.
This is the reason preprocessing comes before feature engineering. If the
data is not cleaned; it has redundant values, has blanks, has outliers and
other defects, the same problems will appear in the feature vectors and
lead to new problems during and after training.
Note that there can be a subjective overlap between preprocessing and
feature engineering, and both terms are sometimes used
interchangeably because they both involve modifying a dataset before
training. If you get these two stages wrong, your model will fail, so
understand the data you are working on well such that you will have an
idea on what your model should look like at the end, and how it ought to
behave.
There are different feature engineering techniques and algorithms
depending on the data and the project. Let’s look at the predominant
techniques:
- Imputation
This is the use of special algorithms called imputers to automatically fill
empty spaces with synthetic values that are intended to fit the
distribution.
- Feature extraction
Feature extraction should not be confused with feature selection. The
latter involves selecting and discarding features, and the former involves
creating, or deriving new features.
Feature extraction can be done manually by applying column-wise
calculations like the ones done in Excel. Binning or discretization is
another technique. It is the division of ranges of continuous values to be
mapped into nominal values called bins.
Algorithmically, feature extraction is any method that derives new
features computationally from existing ones.
- Categorical encoding
This is the translation of categorical values into discrete numerical
representations. Examples of algorithms used in categorical encoding
are one-hot encoder, label encoder, and ordinal encoder.
- Text vectorization
This is the translation of textual data or documents (called corpus) into
vectors based on the frequency and order of the words and other
symbols. There are underlying processes that occur during the
vectorization such as the extraction of tokens, lemmas and stop words
but that is beyond the scope of this note. There are basic text vectorizers
such as the count vectorizer, TF-IDF vectorizer, hash vectorizer and
dictionary vectorizer. But when it comes to advanced vectorizers used in
large-language models, there is Word2Vec, Glove and bag of words.
- Feature scaling
Feature scaling is sometimes called normalization or standardization
because it bears resemblance to the scenario where values of different
units are rescaled to fit into one unit. It is the transformation of the
original numerical values of a feature to fit a distribution where the
variance is low and the distribution is almost normal. In other words, the
range of the values are streamlined to form a smoother distribution. This
is necessary so that the model does not get confused because of the
variation amongst numbers. Common feature scaling algorithms include
min-max scaler, z-score scaler, robust scaler, IQR scaler and log
transformation.
- Audio, image and signal processing
Algorithms such as fourier transforms are used in processing signals and
waveform data. Rolling calculation windows are used to smoothen
indicators of financial series. The content of image data such as pixel
values, textures, and edges are translated into feature vectors using
libraries such as openCV, and torchvision before training them with
convolutional neural networks.
Dimensionality reduction
Dimensionality reduction is the compression or coalescence of several
features to form fewer features, usually, two or three. This is done such
that a lot of information about the original features is retained. This
makes it easier for the machine learning algorithm to train on multiple
features with fewer parameters. Most dimensionality reduction
algorithms are unsupervised (except LDA). There are two types of
dimensionality reduction techniques:
Linear techniques: These assume that the relationships between
features are linear. Examples are PCA, LDA, and MDS.
Non-linear techniques: These make no assumptions about linearity
or non-linearity among the features. They are useful for datasets
where features have very high dimensions. Examples are T-SNE,
UMAP, LLE, autoencoders and Isomap.
There are also feature engineering techniques that are embedded into
some training algorithms such as the convolutional layers of a CNN and
the attention heads of a transformer (in LLMs).
MODEL DEPLOYMENT
After building a model, you ought to deploy them. Inexperienced
developers are stuck on Jupyter environments and find difficulty in
deploying and integrating their models. Model deployment is the end-
to-end integration of models into computer applications or
infrastructure to give them analytical, artificial intelligence, or any
machine learning capabilities. Recall that the model is a program. But
that program simulates reason. Unlike traditional software where every
internal component can be understood since they were hard coded,
machine learning models are often perceived to be black boxes due to a
factor called model complexity. A model’s complexity comprises
properties like the size and dimensions of the feature vector, the
number of features learned, and the number of parameters and
hyperparameters involved. As a result machine learning models can
scale from being as small as a simple linear regression model that needs
just three features of a house to predict its price to incredibly huge
models like DALL-E 3 or GPT 4. It becomes overwhelming when it comes
to the imagining the scale of artificial general intelligence.
Today there are existing frameworks and services that have been
provided to deploy models. Due to the complexity of applications and
infrastructure today, it is difficult and cumbersome to work solo. You
have to team up with other engineers to get the job done. This is the
reason why MLOps engineering is a considered the highest senior role
within the machine learning community because it extends ML
engineering to include other senior infrastructure roles.
- Distributed data processing
Surely, we cannot start talking about model deployment without
discussing scalable data acquisition and processing. In this section we
will briefly discuss data ingestion and big data processing.
Data ingestion is the technical process of segmenting and processing
incoming data from a live source. There are two types of ingestion
processes: batch and stream processes.
Batch processing: This is the segmentation and processing of
incoming data in batches or chunks at regular intervals. This is
done to save memory and account for latency. Tools like Apache
Spark and Airflow are used for scalable batch processing.
Stream processing: This is the processing of bytes incoming data
as it comes; called streams. Live feeds are examples of streaming
processes. Stream processing can be achieved using tools like
Apache Kafka.
Alternative frameworks for distributed data processing include Ray and
Dask. But Dask fits better within the Python ecosystem. It is specifically
designed to bring scalable and parallel data processing where
foundational libraries like numpy and pandas fall short.
- Pipelines
A pipeline is a concept associated with ETL and CICD terms in end-to-end
terminology. It is a framework of mechanical stages where data or code
is ingested, stored and processed or vice-versa before reaching its
subscriber or client. In ML engineering, models can be programmed as
pipelines or integrated into an MLOps workflow. Contextually, there are
two kinds of pipelines:
Model pipelines: This is the structured chain of stages data has to
pass through from loading, transformation, and training before
serialization. In this context, it is not just the trained data that
becomes the model, the entire chain is the model and can be
serialized. The scikit-learn library offers modules for building
unified model pipelines and transformers.
E2E pipelines: E2E stands for end-to-end in DevOps terminology.
In this case, it is a broader term describing the sequence of CICD
operations beginning from ingestion to model training to serving.
Pipelines are build to automate deployment processes at scale.
Engineers design them such that they can be improved and
maintained.
- Model serialization
Serialization is the conversion of information into a secure binary file
format. Serialization makes it easier to integrate models and move them
across environments. Common file formats for serializing models include
onnx, pickle, joblib, hdf5, and torchscript.
- Model registry and experiment tracking
In the real world, many machine learning projects are just experimental,
and many do not make it to production. You can build more models for
the same project, and you would need to track the versions, pipelines,
metrics and performance of your model experiments. You cannot do this
manually using tools like spreadsheets and git. Model registry platforms
are prepared for the job. Such a platform should be able to do the
following:
Experimentation: The platform should be able to track instances
of model runs, including the metrics, hyperparameters, and
artifacts.
Dataset and model versioning: The platform may support
versioning of models and datasets. Git-based platforms like DVC
support dataset versioning.
Pipeline integration: The platform should support the automation
and deployment of either model pipelines or E2E pipelines or
both.
Model registry: The platform should have a repository for models,
including their metadata and track their lifecycle.
Reproducibility: The platform should ensure that the model can
be retrained and used with the same conditions.
Collaboration: The registry should be open for other team
members or persons to have access to the models, and could
share and update them. Ensure that there is an added layer of
security over the registry.
Deployment: The platform must provide options for serialization
and deployment.
There are many registry and experiment logging platforms. Examples
include MLFlow, DVC, ZenML, ClearML, Comet, Neptune, Seldon and
WAB. Note that all of these frameworks have their benefits and
limitations. Research well before picking one for your project.
- Model serving and monitoring
Model serving is the hosting of models on designated server platforms
for subscribers to use. Also monitoring and maintenance systems are
also implemented to provide security and automation. A model serving
platform should offer these key features:
Inference: The platform must be able to collect user requests and
quickly return model outputs using provided accelerators and
APIs.
Auto-scaling: The platform must be able adjust the number of
server instances based on demand. Not all platforms have native
support.
Multi-framework support: The platform should support one or
more frameworks.
Model versioning: The platform should be able to track model
versions.
Caching support: The platform must provide a caching service. It is
unwise to deploy models without cache because it would lead to
inefficiencies for the user.
Monitoring and logging: The platform must be able to monitor
and log the model's activities, and ensure consistency of its
performance.
Deployment environment: The platform must be based on a
deployment environment. The most preferred platforms are
Docker, Kubernetes and any cloud service such as AWS, Azure and
GCP.
Examples of the many model serving providers include BentoML, Seldon
Core, KServe, Nvidia Triton Inference Server, Amazon SageMaker,
TorchServe, and TensorFlow serving. Each of these platforms have their
advantages and limitations. Research well before you use them for your
project.
- Deployment on web applications
Below are a few open-source frameworks for deploying and hosting
projects on the web. You could do a solo portfolio project using these
frameworks.
Streamlit: A low-code framework designed for AI and analytical
applications. No need for reinventing the front-end. The app can
be deployed on Streamlit cloud service. Streamlit comes with its
own caching components so do not worry about bringing in
external caching system.
Gradio: Another low-code framework designed for building
computer vision applications to be deployed on HuggingFace.
Dash: A full-stack web framework built on React and Flask and
provided by Plotly Inc. Dash is specifically designed for building
front-end analytical and geo-analytical applications using visuals
from the Plotly library. Dash supports multiple caching methods
including memoization. There is the proprietary version called
Dash Enterprise that comes with advanced theming features. The
open source version has been extended to interface with
bootstrap, mantine and other community frameworks. Dash does
not have its own native place for hosting apps. But Heroku and
Render are some of the recommended places.
Summary and key takeaways _______________________________
1. Machine Learning is hard.
2. Machine learning algorithms are represented as formula, not
pseudo-code.
3. Pick the appropriate tools, and not just any tool for your project.
4. There are two types of data: categorical and numerical.
5. There are three basic data formats: structured, semi-structured
and unstructured data.
6. To get deep into machine learning, you ought to have domain
knowledge, software development and data analysis skills. You
must also embrace mathematics.
7. It is wise to comply with ALCOA standards when sourcing data.
8. Preprocessing is data wrangling or data cleansing.
9. A good EDA would guide your decisions and give you a good
picture of how your model should look like.
10. EDA is in three levels: univariate, bivariate, and multivariate
analysis.
11. A scalar is a single value that is used to modify the items in an
array.
12. A 1D array is a vector, a 2D array is a matrix. Beyond 2D is called a
tensor.
13. Feature engineering transforms the dataset into feature vectors
before training.
14. There are two types of learning: supervised and unsupervised
learning.
15. There are two approaches to estimation or approximation:
model-based and instance-based learning.
16. There are two types of predictions: classification and regression.
17. A model is a program where the parameters are learned on the
data using function approximation.
18. Highly complex models are black boxes.
19. Hyperparameters are set before training begins.
Hyperparameters influence the model’s parameters.
20. A model stops training after it has found the optimum parameters
that generalizes the data.
21. Training can be improved using regularization, cross-validation
and hyperparameter tuning techniques.
22. Cost functions, classification report, and calibration are basic
metrics for evaluating models after training.
23. MLOps is the highest level role comprising ecosystems, machine
learning life cycle and end-to-end development.
24. Models themselves can be serializable pipelines, and can be
integrated into CICD or ETL pipelines or workflows.