#UpSkillWithKalpesh
Day 16
Data Science
Unlocked
From Zero to Data Hero
Machine Learning
Introduction Part-1
Kalpesh Pathade
@DataSimplified
Machine Learning Introduction - Part 1 1
Machine Learning Introduction -
Part 1
Type @DataSimplified
I. What Is Machine Learning?
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables
computers to learn from data and make decisions without being explicitly
programmed. Instead of following predefined rules, ML models identify patterns
from past data and generalize them to new inputs.
There are three main types of machine learning:
1. Supervised Learning – The model learns from labeled data, where each input
has a corresponding output. Example: Spam detection in emails.
2. Unsupervised Learning – The model finds hidden patterns in unlabeled data.
Example: Customer segmentation in marketing.
3. Reinforcement Learning – The model learns by interacting with the
environment and receiving rewards for correct actions. Example: Self-driving
cars.
Machine Learning Example: Email Spam Filter
Machine Learning Introduction - Part 1 2
Step 1: The Problem
You receive hundreds of emails every day. Some are important emails from
friends or work, and some are spam—like unwanted ads or promotions. You donʼt
want to manually go through every email to figure out which ones are spam.
Step 2: Teaching the System
Hereʼs where machine learning comes in: imagine you have a special spam filter
that can automatically sort emails. But how does it know which emails are spam
and which ones are not? You need to teach it.
At first, the filter doesnʼt know anything about spam. So, you start by giving it
examples of emails that are spam (e.g., "Buy now!" or "Congratulations, you
won a prize!") and emails that are not spam (e.g., "Meeting at 10 AM" or
"Family get-together this weekend").
You mark a few emails as spam and others as not spam. The filter looks at
patterns—like the words in the email, the subject, who sent it, etc.
Step 3: Learning the Patterns
After looking at many examples, the spam filter starts to learn the difference
between spam and non-spam emails.
For example, it might learn that emails with words like "free," "discount," or
"offer" are often spam. It may also notice that spam emails often come from
unknown senders.
Machine Learning Introduction - Part 1 3
Step 4: Making Predictions
Once the system has learned enough from the examples, it can start
predicting on its own. Now, when you get a new email, the spam filter looks at
it, checks for the patterns it learned (like certain words or the senderʼs email),
and decides if itʼs spam or not spam.
Step 5: Continuous Improvement
The filter gets better over time. As you keep marking new emails as spam or
not spam, it gets more examples to learn from and becomes smarter at
identifying what is and isnʼt spam.
In Summary: Machine learning is like teaching a computer system how to make
decisions based on patterns it learns from examples. In the case of the spam filter,
you showed it examples of spam and non-spam emails, and it learned how to
classify new emails on its own.
Just like how you get better at recognizing spam emails the more you see them,
the machine learning system improves as it gets more examples!
II. Why Use Machine Learning?
ML is used because traditional programming methods struggle with complex tasks
that require adaptability and pattern recognition. Some key reasons to use ML
include:
1. Automation of Complex Tasks – ML enables automation in areas where rule-
based programming fails, such as fraud detection.
2. Improved Decision Making – Data-driven insights help businesses make
accurate predictions and decisions.
3. Handling Large-Scale Data – ML models can analyze vast amounts of data
quickly and efficiently.
4. Self-Improvement Over Time – ML models improve as they learn from more
data, increasing accuracy.
Machine Learning Introduction - Part 1 4
5. Customization and Personalization – AI-powered recommendations (like
Netflix and YouTube) enhance user experience.
III .Examples of Applications
1. Healthcare – Disease diagnosis, personalized medicine, and patient
monitoring using ML algorithms.
2. Finance – Fraud detection, stock market predictions, and automated trading.
3. E-commerce – Product recommendations, customer segmentation, and
chatbots.
4. Social Media – Content recommendations, fake news detection, and sentiment
analysis.
5. Self-Driving Cars – Autonomous vehicles use ML for navigation and obstacle
detection.
6. Cybersecurity – Intrusion detection, malware classification, and threat
prediction.
7. Manufacturing – Predictive maintenance and quality control in production
lines.
IV. Types of Machine Learning Systems
Machine Learning systems can be categorized based on how they learn and make
predictions. The three primary ways to classify ML systems are:
1. Based on Training Supervision (Supervised, Unsupervised, Semi-supervised,
Reinforcement Learning)
2. Based on Learning Method (Batch Learning vs. Online Learning)
3. Based on Generalization Approach (Instance-Based vs. Model-Based
Learning)
Machine Learning Introduction - Part 1 5
1. Training Supervision
This classification is based on how a machine learning model learns from data.
a) Supervised Learning
The model is trained on labeled data, meaning each input has a corresponding
correct output.
The goal is to learn a mapping function from inputs to outputs.
Examples:
Regression (Predicting house prices based on features like size, location)
Classification (Email spam detection – classifying emails as spam or not
spam)
Real life Example: Classifying Fruits 🍎🍌🍇
Scenario:
Imagine you are teaching a child to recognize different fruits.
How It Works:
Machine Learning Introduction - Part 1 6
You show the child an apple and say, "This is an apple."
You show a banana and say, "This is a banana."
You repeat this process with many labeled fruits (grapes, oranges, etc.).
Over time, the child learns the patterns—apples are red and round, bananas are
yellow and long.
Machine Learning Version:
A computer is given labeled data:
Input: Pictures of apples, bananas, and grapes.
Output: The correct fruit label.
Once trained, the model can predict new fruits based on what it has learned.
b) Unsupervised Learning
The model is trained on unlabeled data, meaning there are no predefined
outputs.
It identifies patterns, structures, or groupings in the data.
Examples:
Clustering (Customer segmentation in marketing)
Dimensionality Reduction (Reducing the number of variables in large
datasets)
Machine Learning Introduction - Part 1 7
Real life Example: Grouping Fruits Without Labels 🍎🍌🍇
Scenario:
Now, imagine you give the same child a basket full of different fruits but donʼt tell
them the names.
How It Works:
The child groups similar-looking fruits together (red round ones, long yellow
ones, small purple ones).
Even though they donʼt know the names, they naturally cluster them based on
patterns like color, shape, and size.
This technique is used in:
Customer Segmentation (Grouping customers based on shopping habits).
Anomaly Detection (Detecting fraud in credit card transactions).
Machine Learning Version:
A computer is given unlabeled data:
Input: Pictures of apples, bananas, and grapes—without labels.
What happens? The model groups similar-looking fruits together but doesnʼt
assign names.
c) Semi-Supervised Learning
Machine Learning Introduction - Part 1 8
A mix of supervised and unsupervised learning, where only a small portion of
the dataset is labeled, and the rest is unlabeled.
Example: Identifying fraudulent transactions in banking with a small labeled
dataset.
Real life Example: Learning with Few Labeled Examples 📸
Scenario:
Imagine you have a huge photo album, but only a few pictures are labeled (like
“Birthday Party,ˮ “Vacation,ˮ etc.), while the rest have no labels.
How It Works:
You show a few labeled pictures to an AI.
The AI learns from those examples and then tries to guess the labels of the
remaining photos based on similarities.
Machine Learning Version:
Used in medical diagnosis, where doctors label a few X-rays, and the AI helps
classify the rest.
Machine Learning Introduction - Part 1 9
Used in speech recognition, where only some audio clips are labeled, and the
AI learns to recognize words.
d) Reinforcement Learning
The model learns by interacting with an environment and receiving rewards or
penalties for its actions.
Used in game-playing AI, robotics, and self-driving cars.
Example: Training an AI to play chess using rewards for winning moves.
Real life Example: Training a Dog 🐶🎾
Scenario:
Imagine you are teaching your dog a new trick, like fetching a ball.
How It Works:
Machine Learning Introduction - Part 1 10
If the dog fetches the ball correctly, you give it a treat (reward).
If the dog runs in the wrong direction, you give no treat (penalty).
Over time, the dog learns that fetching the ball gets a reward, so it keeps
improving.
Machine Learning Version:
A computer (or robot) learns through trial and error:
Example: A self-driving car gets rewarded for staying in the lane and
penalized for drifting off-road.
Example: A chess-playing AI gets points for winning and loses points when it
makes bad moves.
e) Self-Supervised Learning
Machine Learning Introduction - Part 1 10
The model generates its own labels from raw data without human intervention.
Used in natural language processing (NLP), computer vision, and
representation learning.
Example: Predicting the next word in a sentence—like how chatbots and
autocomplete features learn by filling in missing words from large amounts of
text.
Real life Example: Filling in Missing Words in a Sentence (Like a
Puzzle) 🧩
Imagine you're reading a book, and you come across a sentence with a missing
word:
👉 "The cat sat on the ___."
Even without being told, you can guess the missing word is "mat" based on your
experience with language.
How This Relates to Self-Supervised Learning:
Instead of labeling data manually, the model removes parts of the data itself
and tries to predict the missing information.
Over time, it learns patterns and improves its predictions.
Machine Learning Version:
Used in NLP (Natural Language Processing) * AI models like ChatGPT and
Googleʼs BERT learn by predicting missing words in huge amounts of text.
Used in Computer Vision * AI can predict missing parts of an image, like how
Photoshopʼs "Content-Aware Fill" works.
2. Batch Learning vs. Online Learning
This classification is based on how frequently the model learns from new data.
a) Batch Learning (Offline Learning)
The model is trained on the entire dataset at once and does not update
frequently.
Machine Learning Introduction - Part 1 11
It requires retraining from scratch when new data is available.
Example: Training a recommendation system using historical user data every
few months.
Pros:
✅ More stable and accurate since it learns from large datasets.
✅ Suitable for models that donʼt need frequent updates.
Cons:
❌ Expensive to retrain on new data.
❌ Cannot adapt to real-time changes.
b) Online Learning
The model updates continuously as new data becomes available.
Suitable for applications that require real-time learning.
Example: Stock price prediction models updating with new market data every
second.
Pros:
✅ Can adapt to real-time changes.
✅ Works well with streaming data and dynamic environments.
Cons:
❌ Risk of learning incorrect patterns due to noisy data.
❌ Requires careful tuning to prevent forgetting old data (catastrophic forgetting).
3. Instance-Based vs. Model-Based Learning
This classification is based on how the model generalizes from training data.
a) Instance-Based Learning (Memory-Based Learning)
The model memorizes examples and uses similarity-based techniques to
make predictions.
It does not create an explicit mathematical model.
Machine Learning Introduction - Part 1 12
Example:
K-Nearest Neighbors (KNN): Predicts an output by looking at the most
similar past examples.
Pros:
✅ Simple and interpretable.
✅ Works well with small datasets.
Cons:
❌ Slow for large datasets because it searches the entire training set during
predictions.
❌ Requires large storage since it memorizes all examples.
b) Model-Based Learning
The model creates a mathematical representation (model) from training data
and uses it to make predictions on new data.
Example:
Linear Regression: Finds the best-fitting line for predicting a numerical
value.
Pros:
✅ Faster predictions since the model generalizes from past data.
✅ More efficient with large datasets.
Cons:
❌ Requires careful selection of the right model and hyperparameters.
❌ May not perform well if the model assumptions do not match the data.
Key Points
Understanding these types of ML systems helps in selecting the right approach
for a given problem.
If labeled data is available Use Supervised Learning
If patterns need to be found in unlabeled data Use Unsupervised Learning
Machine Learning Introduction - Part 1 13
If real-time updates are needed Use Online Learning
If a quick memory-based approach is required Use Instance-Based
Learning
If a mathematical model is preferable Use Model-Based Learning
V. Main Challenges of Machine Learning
1. Insufficient Quantity of Training Data
Definition:
Insufficient training data refers to a situation where the amount of data
available for training the machine learning model is too small to accurately
capture the underlying patterns and relationships in the data.
Impact:
Model Performance: Insufficient data leads to poor generalization. The model
may not learn the underlying data distribution, leading to inaccurate
predictions.
Overfitting Risk: With too few samples, the model might memorize the training
data (overfitting) rather than learning to generalize.
Solutions:
Data Augmentation: Using techniques such as random transformations
(flipping, cropping, rotating) in image data or adding noise to audio data.
Synthetic Data Generation: Generating new data samples using methods like
GANs (Generative Adversarial Networks).
Transfer Learning: Leveraging pre-trained models on large datasets and fine-
tuning them on smaller datasets.
Active Learning: An iterative approach where the model requests labels for
the most uncertain instances.
Machine Learning Introduction - Part 1 14
2. Nonrepresentative Training Data
Definition:
Nonrepresentative training data refers to a situation where the data used to
train a model does not accurately reflect the real-world problem it is trying to
solve. This can happen due to biases, sampling errors, or data being skewed.
Impact:
Bias in Predictions: Models trained on nonrepresentative data will likely
exhibit biased or skewed predictions that do not reflect the target population.
Lack of Generalization: The model will not generalize well to unseen data
from the real world, leading to poor performance when deployed.
Solutions:
Balanced Sampling: Ensuring the dataset includes a wide range of examples
across different classes or subpopulations.
Stratified Sampling: Using methods that ensure representative distribution
across various groups.
Data Collection from Diverse Sources: Collecting data from different
demographics, environments, or conditions to ensure comprehensive
coverage.
3. Poor-Quality Data
Definition:
Poor-quality data refers to data that is noisy, incomplete, incorrect, or
otherwise unreliable. This can include missing values, duplicate records, or
incorrectly labeled data.
Impact:
Inaccurate Predictions: Poor-quality data directly affects the accuracy of the
machine learning model. The model learns from faulty data, which results in
poor performance.
Training Instability: Poor data quality can cause the model to become
unstable during training, leading to erratic learning curves or convergence to
Machine Learning Introduction - Part 1 15
suboptimal solutions.
Solutions:
Data Cleaning: Handling missing values, removing duplicates, correcting
erroneous labels, and ensuring consistency across data entries.
Imputation: Filling in missing data through techniques like mean imputation or
using machine learning models (e.g., k-NN imputation).
Outlier Detection: Identifying and handling outliers that may distort the
learning process.
Noise Filtering: Using algorithms or pre-processing methods to remove
irrelevant noise.
4. Irrelevant Features
Definition:
Irrelevant features refer to input variables in the dataset that do not contribute
to predicting the target variable. Including such features can degrade model
performance.
Impact:
Increased Complexity: Irrelevant features add noise and unnecessary
complexity, making the model harder to train and interpret.
Overfitting Risk: Including irrelevant features increases the risk of overfitting,
as the model may begin to "memorize" noise rather than focusing on relevant
patterns.
Solutions:
Feature Selection: Using statistical tests, algorithms like Recursive Feature
Elimination (RFE), or models like Random Forests to identify and remove
irrelevant features.
Principal Component Analysis (PCA): A dimensionality reduction technique
that helps reduce the number of features by transforming them into orthogonal
components.
Machine Learning Introduction - Part 1 16
Domain Knowledge: Incorporating expert knowledge to identify which
features are most likely to be relevant.
5. Overfitting the Training Data
Definition:
Overfitting occurs when the model learns the details and noise in the training
data to the point where it negatively impacts the model's performance on new
data. The model becomes too complex and too specific to the training data,
losing its ability to generalize.
Impact:
Poor Generalization: The model performs exceptionally well on training data
but fails to generalize to new, unseen data.
High Variance: Overfitting leads to high variance in the model, meaning small
changes in the input data can lead to large changes in predictions.
Solutions:
Cross-Validation: Using k-fold cross-validation to ensure the modelʼs
performance is stable across different subsets of the data.
Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization
penalize overly complex models and prevent them from fitting noise in the
data.
Pruning: In decision trees and neural networks, pruning refers to removing
nodes or layers that add complexity without improving performance.
Early Stopping: Monitoring the modelʼs performance on a validation set during
training and stopping when performance starts to deteriorate.
6. Underfitting the Training Data
Definition:
Underfitting occurs when the model is too simple to capture the underlying
patterns in the data. It may have high bias and fails to learn from the training
data sufficiently, leading to poor performance both on training and test data.
Machine Learning Introduction - Part 1 17
Impact:
Poor Training Performance: The model performs poorly even on the training
data, meaning it hasnʼt learned enough from the data.
High Bias: Underfitting is a result of high bias, where the model makes strong
assumptions that are too restrictive.
Solutions:
Model Complexity: Increasing the complexity of the model (e.g., adding more
layers in a neural network or increasing the depth of a decision tree).
Feature Engineering: Adding or transforming features to provide more
information to the model.
Reducing Regularization: If regularization is too strong, it may prevent the
model from learning adequately. Reducing regularization can help the model
fit the data better.
Machine Learning Introduction - Part 1 18