ML Module 1
ML Module 1
INTRODUCTION TO MACHINE L E A R N I N G
.IN
3. Advanced Algorithms: The development of complex algorithms, especially in deep
learning, enables more powerful machine learning applications.
The Knowledge Pyramid
C
1. Data: Basic facts and raw numbers. Organizations store vast amounts of data from
sources like databases and warehouses.
N
2. Information: Processed data revealing patterns or associations. For instance,
SY
analyzing sales data to determine the best-selling product.
3. Knowledge: Condensed information, such as historical patterns and future trends.
Extracting knowledge from data is crucial for decision-making.
4. Intelligence: Applied knowledge. It represents actionable insights, such as strategies
U
derived from knowledge.
5. Wisdom: The highest level, where intelligence evolves into maturity and sound
VT
.IN
to automatically predict unknown outcomes.
Learning System
• Human Learning (Fig. 1.2a): Humans make decisions based on experience.
C
• Machine Learning (Fig. 1.2b): Machines create models from data patterns and
use these models for prediction, akin to human experience.
N
SY
U
VT
Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor Machine Learning
• The quality of data directly impacts the quality of experience and, ultimately, the
quality of learning systems.
Statistical Learning
• In statistical learning, the relationship between input x and output y is modeled as:
y = f(x)
o f is the learning function mapping inputs to outputs.
• In machine learning, this is referred to as the mapping of input to output.
.IN
Model in Machine Learning
• A model is a summary of raw data, structured into a representation for decision-
making.
C
• Models can be in different forms, such as:
1. Mathematical equations: e.g., linear regression formula
N
2. Relational diagrams: like decision trees or graphs
SY
3. Logical rules: like if/else rules (e.g., rule-based spam filters)
4. Clusters: for grouping data (e.g., k-means clustering)
Patterns vs. Models
U
• Pattern: Local; applicable to specific parts of the data.
• Model: Global; fits the entire dataset.
VT
o Example: A model trained to detect spam can be used to predict whether an email is
spam or not.
Tom Mitchell’s Definition of Machine Learning
Tom Mitchell’s famous definition:
“A computer program is said to learn from experience E, with respect to task
T and some performance measure P, if its performance on T, measured by P,
improves with experience E.”
• Experience (E): Data used to learn (e.g., thousands of images to train an object
detection model).
• Task (T): The job the machine does (e.g., detecting objects in images).
• Performance measure (P): How well the machine performs the task (e.g., precision,
recall).
Example:
• Task (T): Detecting an object in images.
.IN
• Machine Experience: Gained through data processing and model building:
1. Data Collection: Gathering data from the environment (e.g., images, text, or
numbers).
C
2. Abstraction: Extracting key features from the data (e.g., identifying basic features of
an elephant: trunk, ears).
N
3. Generalization: Turning abstraction into an actionable form, like forming rules
(heuristics) from past experiences.
SY
▪ Example: A self-driving car generalizes rules about stopping at red lights.
4. Heuristics: Actionable “rules of thumb” that guide decisions based on prior
experience.
U
▪ Example: A heuristic rule: If you see a red light, stop.
▪ Heuristics can sometimes fail but are typically effective.
VT
intelligence
.IN
C
N
Figure 1.3: Relationship of AI with Machine Learning
winters, where enthusiasm and funding declined. AI’s resurgence came with the
rise of data-driven systems—models that learn by finding patterns in data. This
VT
.IN
Big data is part of data science and refers to massive volumes of data generated by
companies like Facebook, Twitter, and YouTube. It deals with three key
characteristics:
C
1. Volume: The sheer amount of data being generated.
2. Variety: Data comes in many forms—text, images, videos, etc.
N
3. Velocity: The speed at which data is generated and processed.
Big data is essential for machine learning because many algorithms rely on large
SY
datasets for training. For example, deep learning (a subfield of ML) uses big data
for tasks like image recognition and language translation.
Data Mining:
U
Data mining originally came from business applications. It’s like "mining" for
valuable information hidden in large datasets. While data mining and machine
VT
analytics
.IN
C
Figure 1.4: Relationship of Machine Learning with Other Major Fields
N
1.3.3 Machine Learning and Statistics
SY
1. Statistics:
• Definition: A branch of mathematics focused on analyzing and interpreting data to
uncover patterns and relationships.
U
• Key Features:
o Hypothesis-driven: Starts with a hypothesis and tests it through experiments.
VT
o Flexibility: Works well with large, complex datasets; adaptable to different scenarios.
o Goal: Makes predictions based on learned patterns, often without needing detailed
statistical knowledge.
1.2 TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of
interaction of the program with its environment. There are four types of machine learning
as shown in Figure 1.5.
.IN
C
N
SY
U
Labelled and Unlabelled Data: Data is a raw fact. Normally, data is represented in the form
of a table. Data also can be referred to as a data point, sample, or an example. Each row of the table
represents a data point. Features are attributes or characteristics of an object. Normally, the
columns of the table are attributes. Out of all attributes, one attribute is important and is called a
label. Label is the feature that we aim to predict. Thus, there are two types of data – labelled and
unlabelled.
Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower dataset
or Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width
of sepals and petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1.
Table 1.1: Iris Flower Dataset
S.No. Length ofPetal Width ofPetal Length ofSepal Width ofSepal Class
A dataset need not be always numbers. It can be images or video frames. Deep neural
networks can handle images with labels. In the following Figure 1.6, the deep neural
network takes images ofdogs and cats with labels for classification. In unlabelled data, there
are no labels in the dataset.
.IN
C
N
Cat
SY
(a) (b)
Figure 1.6: (a) Labelled Dataset (b) Unlabeled Dataset
U
Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or
teachercomponent in supervised learning. A supervisor provides labelled data so that the
model is constructed and generates test data.
In supervised learning algorithms, learning takes place in two stages. In layman terms, during
thefirst stage, the teacher communicates the information to the student that the student is
supposed tomaster. The student receives the information and understands it. During this
stage, the teacher has noknowledge of whether the information is grasped by the student.
This leads to the second stage of learning. The teacher then asks the student a set of
questionsto find out how much information has been grasped by the student. Based on
these questions, the student is tested, and the teacher informs the student about his
assessment. This kind of learningis typically called supervised learning.
Supervised learning has two methods:
1. Classification
2. Regression
Classi𝑓ication
.IN
1. Training Stage:
o The algorithm is given a dataset that includes both the features (input) and their
correct labels (output).
C
o The algorithm learns from this data and creates a model.
2. Testing Stage:
N
o The model is tested on new, unseen data (input), and it predicts the label (output).
SY
o For example, if you input an image of a dog or cat that the model hasn’t seen before, the
model will assign the correct label based on what it has learned.
Example:
In the Iris dataset, if you input data like (6.3, 2.9, 5.6, 1.8, ?), the model will predict the
U
missing label. This process of assigning a label to new data is called classification.
VT
Applications of Classi𝑓ication:
Image Recognition: Classifying images of animals, plants, or even medical conditions like
cancer.
Types of Classi𝑓ication Models:
Classification models can be grouped into two categories:
1. Generative Models: Focus on how the data is generated and its distribution (e.g., Naı¨ve
Bayes).
2. Discriminative Models: Focus only on distinguishing between different classes (e.g.,
Support Vector Machines).
Key Classi𝑓ication Algorithms:
• Decision Tree
• Random Forest
• Support Vector Machines (SVM)
• Naïve Bayes
• Artificial Neural Networks (ANN) and Deep Learning (e.g., Convolutional Neural
Networks - CNN)
.IN
C
N
SY
Regression Models
Regression is another type of supervised learning, similar to classification, but
instead of predicting categories (labels), it predicts continuous values, like
U
numbers.
Key Difference:
VT
.IN
C
N
Similarities Between Regression and Classi𝑓ication:
SY
• Both are supervised learning methods, meaning they require a labeled training
dataset.
• Both involve a training stage (where the model learns from data) and a testing stage
(where the model is used to make predictions on new data).
U
Main Difference:
VT
One of the most common regression algorithms is linear regression, which fits a
straight line to the data.
Unsupervised learning
is a type of learning where there is no supervisor or teacher guiding the process.
Instead, the algorithm learns by itself using trial and error.
How Unsupervised Learning Works:
• In this method, the algorithm is given data without any labels.
• The algorithm looks at the data and tries to find patterns or groupings on its own.
• The goal is to group similar objects together based on their characteristics.
Example of Unsupervised Learning:
Clustering
• Clustering is a common unsupervised learning technique.
• It groups objects into different clusters, where each cluster contains objects that are
similar to each other.
• The objects in one cluster are different from those in other clusters.
For example, if you have a set of images of dogs and cats, a clustering algorithm
will automatically group them into two clusters: one for dogs and one for cats,
without needing any labels to tell it which is which.
Applications of Clustering:
.IN
• Image Segmentation: Grouping parts of an image, like separating a region of interest
(e.g., identifying a tumor in a medical image).
• Gene Analysis: Finding groups of similar genes in a database.
C
In summary, unsupervised learning helps the algorithm discover patterns in data
without any explicit instructions. Cluster analysis and dimensional reduction are key
N
types of unsupervised learning.
SY
U
VT
3. Assigns categories or labels Performs grouping process such that similar objectswill
be in one cluster
.IN
1.4.2 Semi-supervised Learning
C
There are circumstances where the dataset has a huge collection of unlabelled data and
some labelled data. Labelling is a costly process and difficult to perform by the humans.
N
Semi-supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the
labelled and pseudo-labelled dataset can be combined.
1.4.3 Reinforcement Learning
SY
Reinforcement learning is a type of machine learning where an agent learns by
interacting with its environment. The agent performs actions and receives feedback in
the form of rewards or penalties, and its goal is to maximize the total reward over time.
U
Key Concepts:
VT
Goal
.IN
Danger
In this grid game, the gray til e indicates the dang er, black is a block, and the tile with
diagonallines is the goal. The a Fim
i g ui rset1o. 1s0t :a rAt ,Gsr iadyg farmoem bottom-left grid, using the actions
left, right, top andbottom to reach the goal state.
C
To solve this sort of problem, there is no data. The agent interacts with the environment
toget experience. In the above case, the agent tries to create a model by simulating many
N
paths and finding rewarding paths. This experience helps in constructing a model.
SY
1.5 CHALLENGES OF MACHINE LEARNING
Machine learning allows computers to solve certain types of problems much
better than humans, especially tasks involving computation. For instance,
U
computers can quickly calculate the square root of large numbers or win games
like chess and Go against professional players.
VT
However, humans are still better than machines at tasks like recognition, though
modern machine learning systems, especially deep learning, are improving
rapidly. For example, machines can recognize human faces instantly. But there are
still challenges in machine learning, mainly due to the need for high-quality data.
Key Challenges in Machine Learning:
1. Well-Posed Vs Ill-Posed Problems:
o Machine learning works well with well-posed problems, where the problem is clearly
defined and has enough information to find a solution.
o In ill-posed problems, there may be multiple possible answers, making it hard to find
the correct one. For example, in a simple dataset (as shown in Table 1.3), several models
could fit the data (e.g., multiplication or division). To solve such problems, more data is
needed to narrow down the correct model.
Table 1.3: An Example
1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5
Can a model for this test data be multiplication? That is, y = x1 * x2. Well! It is true! But, this
is equally true that y may be y = x1 / x2 or y = x1 ^ x2. So, there are three functions that fit
.IN
the data.
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become an
ill-posed problem and scientific computation has many ill-posed problems.
C
2. Need for Huge, Quality Data:
N
o Machine learning requires large amounts of high-quality data. The data must be
complete, without missing or incorrect values. Poor-quality data can lead to inaccurate
models.
SY
3. High Computational Power:
o With the growth of Big Data, machine learning tasks require powerful computers with
specialized hardware like GPUs or TPUs to handle the high computational load. The
U
increasing complexity of tasks has made high-performance computing essential.
4. Complexity of Algorithms:
VT
o Choosing the right machine learning algorithm, explaining how it works, applying it
correctly, and comparing different algorithms are now critical skills for data scientists.
This makes the selection and evaluation of algorithms a significant challenge.
5. Bias-Variance Trade-off:
o Overfitting: When a model performs well on training data but fails on test data, it’s
called overfitting. This means the model has learned the training data too well but lacks
generalization to new data.
o Underfitting: When a model fails to perform well on both training and test data, it’s
called underfitting. The model is too simple to capture the patterns in the data.
o Balancing between overfitting and underfitting is a major challenge for machine
learning algorithms.
1.6 MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is
CRISP-DM.Since machine learning is like data mining, except for the aim, this process can
be used for machinelearning. CRISP-DM stands for Cross Industry Standard Process – Data
Mining. This process involves six steps. The steps are listed below in Figure 1.11.
.IN
C
N
SY
1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
U
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.
VT
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the
raw data and preparation of data for the data mining process. The missing values may
cause problems during both training and testing phases. Missing data forces classifiers to
produceinaccurate results. This is a perennial problem for the classification models. Hence,
suitablestrategies should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the
datato obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
For example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.
.IN
emoticons effectively. For movie reviews or product reviews, five stars or onestar are automatically
attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible.For
example, Amazon recommends users to find related books or books bought by peoplewho have the
same taste like you, and Netflix suggests shows or related movies of your taste. The
C
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
N
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
SY
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some ofthe
machine learning applications.
U
Table 1.4: Applications’ Survey Table
VT
7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation
9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification
.IN
11. Multimedia and Face recognition/identification, biometric projects like identification
Security of a person from a large image or video database, and applications
involving multimedia retrieval
C
12. Scientific Domain Discovery of new galaxies, identification of groups of houses based
on house type/geographical location, identification of earthquake
epicenters, and identification of similar land use
N
Key Terms:
SY
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
U
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.
• Big Data – A study of data that has characteristics of volume, variety, and velocity.
• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.
.IN
• Unsupervised Learning – A type of machine leaning that uses unlabelled data and groups the attributes
to clusters using a trial and error approach.
• Cluster Analysis – A type of unsupervised approach that groups the objects based on attributesso
that similar objects or data points form a cluster.
C
• Semi-supervised Learning – A type of machine learning that uses limited labelled and largeunlabelled
data. It first labels unlabelled data using labelled data and combines it for learning purposes.
N
• Reinforcement Learning – A type of machine learning that uses agents and environment interactionfor
creating labelled data for learning.
SY
• Well-posed Problem – A problem that has well-defined specifications. Otherwise, the problem is called
ill-posed.
• Bias/Variance – The inability of the machine learning algorithm to predict correctly due to lackof
generalization is called bias. Variance is the error of the model for training data. This leads to problems called
overfitting and underfitting.
U
• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.
VT
.IN
• In computer systems, these facts are encoded in bits, allowing
machines to process and store them.
• Directly interpretable data: Numbers or text, like "John is 25
years old."
C
• Diffused data: Data like images or videos that require computers
to interpret, like identifying objects in a photo.
N
Types of Data Sources
SY
1. Flat files: Simple files like CSV or text files.
2. Databases: Systems that store structured data.
3. Data warehouses: Centralized repositories for large volumes of
data.
U
Operational vs. Non-operational Data
VT
.IN
The 6 Vs of Big Data
1. Volume:
o Refers to the size of data.
C
o Big Data is often measured in petabytes (PB) or exabytes (EB),
much larger than the gigabytes or terabytes of traditional data.
N
2. Velocity:
o The speed at which data is generated and processed.
o
in real-time.
SY
Thanks to IoT devices and the Internet, data arrives rapidly, often
3. Variety:
U
o The diversity of data formats:
▪ Form: Data comes as text, audio, video, graphs, etc.
VT
.IN
⚫ Precision is defined as the closeness of repeated measurements.
Often, standard deviation is used to measure the precision.
⚫ Bias is a systematic result due to erroneous assumptions of the
C
algorithms or procedures.
⚫ Accuracy is the degree of measurement of errors that refers to the
N
closeness of measurements to the true value of the quantity. Normally,
the significant digits used to store and manipulate indicate the
accuracy of the measurement.
1. Structured Data
Definition: Structured data is organized and stored in a
predefined format, such as a table in a database. This data is easy
to search, retrieve, and analyze using tools like SQL.
Types of Structured Data:
• Record Data:
o A dataset consists of a collection of measurements.
o Rows (entities, cases, or records) represent objects.
o Columns (attributes, features, or fields) represent measurements
for each object.
o A label refers to individual observations in the dataset.
• Data Matrix:
.IN
o Matrix operations can be applied to analyze this data.
• Graph Data:
o Represents relationships between objects.
C
o Example: In a web graph, nodes are web pages, and edges
(hyperlinks) connect them.
N
• Ordered Data:
o Objects have attributes with an implicit order.
o Examples of ordered data: SY
▪ Temporal data: Attributes associated with time, e.g., customer
purchase patterns during festivals.
U
▪ Sequence data: Sequence of elements without timestamps, e.g.,
DNA sequences (A, T, G, C).
VT
.IN
3. Semi-Structured Data
Semi-structured data falls between structured and
unstructured data. While it does not conform to a strict
structure, it contains tags or markers that make it easier to
C
organize.
• Examples of semi-structured data include:
N
o XML/JSON files: Contain data with embedded tags or fields.
a database. SY
o RSS feeds: Often follow a hierarchical structure, but not as rigid as
Flat Files These are the simplest and most commonly available data
source. It is also the cheapest way of organizing the data. These flat
files are the files where data is stored in plain ASCII or EBCDIC format.
Minor changes of data in flat files affect the results of the data mining
algorithms.
Hence, flat file is suitable only for storing small dataset and not
desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the
values are separated by commas. These are used by spreadsheet and
database applications. The first row may have attributes and the rest
of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values
are separated by Tab. Both CSV and TSV files are generic in nature and
.IN
can be shared. There are many tools like Google Sheets and Microsoft
Excel to process these files.
Database System It normally consists of database files and a
C
database management system (DBMS). Database files contain original
data and metadata. DBMS aims to manage data and improve operator
performance by including various tools like database administrator,
N
query processing, and transaction manager. A relational database
consists of sets of tables. The tables have rows and columns. The
SY
columns represent the attributes and rows represent tuples. A tuple
corresponds to either an object or a relationship between objects. A
user can access and manipulate the data in the database using SQL.
U
Different types of databases are listed below:
1 A transactional database is a collection of transactional records.
VT
.IN
needs to be shared across the platforms.
Data Stream It is dynamic data, which flows in and out of the
observing environment. Typical characteristics of data stream are
huge volume of data, dynamic, fixed order movement, and real-time
C
constraints.
N
RSS (Really Simple Syndication) It is a format for sharing instant
feeds across services.
SY
JSON (JavaScript Object Notation) It is another useful data
interchange format that is often used for many machine learning
algorithms.
U
2.2 BIG DATA ANALYTICS AND TYPES OF ANALYTICS
The primary aim of data analysis is to assist business organizations to
VT
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
.IN
Descriptive Analytics
C
Descriptive analytics is about summarizing the main features of the
data you've collected. It tells you what has happened by using
N
historical data and statistical techniques. The goal is to organize,
describe, and present this data in an understandable way. Think of it
Key Point: It doesn't explain why the sales were high or low—just
tells you what the data shows.
Diagnostic Analytics
Diagnostic analytics answers the question "Why did this happen?"
It's about understanding the root cause of an event. By examining the
data closely, we look for patterns, trends, and relationships that
explain the cause of an outcome.
Example: If the store's sales drop one month, diagnostic analytics
would investigate why the drop happened. Maybe it’s due to bad
weather, a competitor’s sale, or a new product that didn’t perform
well. The analysis focuses on finding and explaining the reasons
behind the drop.
Key Point: It's all about cause and effect—identifying the reasons
behind the data patterns.
Predictive Analytics
Predictive analytics looks into the future and answers the question
"What will happen?" Using historical data and advanced algorithms
(like machine learning), it predicts future trends and outcomes.
Example: The store uses data from previous years to predict what the
.IN
sales will be in the upcoming holiday season. Algorithms analyze
patterns like past holiday sales, customer behavior, and current
market trends to make predictions.
Key Point: It focuses on forecasting future events based on current
C
and past data.
N
Prescriptive Analytics
Prescriptive analytics goes a step further and asks "What should we
SY
do?" It not only predicts the future but also recommends actions to
take. This type of analytics provides decision-making support by
suggesting the best course of action to achieve desired outcomes.
Example: After predicting that sales will be low in the next quarter,
U
prescriptive analytics suggests specific actions the store can take, such
as launching a promotion, adjusting prices, or stocking more popular
VT
.IN
application as desired by the domain knowledge engineer.
C
media data and multimodal data.
1. Open or public data source – It is a data source that does not have
N
any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes.
SY
Government census data are good examples of open data:
• Digital libraries that have huge amount of text data as well as
document images
U
• Scientific domains with a huge collection of experimental data like
genomic data and biological data
VT
.IN
missing information, or inconsistencies that can affect the results of
the analysis.
C
Common Problems with Dirty Data:
N
1. Incomplete Data: When certain values are missing from the
dataset.
SY
2. Inaccurate Data: Data that has incorrect values or errors.
3. Outliers: Data points that are significantly different from the rest
of the data, often due to errors or unusual circumstances.
U
4. Missing Values: Data entries where certain attributes are not
provided.
VT
o For patients John, Andre, and Raju, the Date of Birth (DoB) is
missing. This is an example of missing values.
2. Inaccurate Data:
.IN
o David's age is recorded as 5, but his DoB is 10/10/1980, which
makes his real age much older than 5. This is inconsistent data.
o Raju's age is recorded as 136, which is not realistic. This might be
a typographical error or an outlier.
C
3. Outliers:
N
o Raju’s age of 136 is an outlier, as it is an unrealistic value when
compared to normal human lifespans. Outliers are often caused by
data entry errors.
4. Noisy Data: SY
o John’s salary is recorded as -1500, which is not possible. Salary
cannot be negative, making this an example of noisy data.
U
o The entry for David’s salary is simply blank (" "), which is another
instance of missing data.
VT
5. Inconsistent Values:
o In the salary column, Andre and Raju both have ‘Yes’ recorded,
which doesn’t make sense in the context of salary data. A salary should
be a numeric value, not a text response.
How to Address These Issues?
1. Missing Data:
o Ignore the Tuple: If a lot of values are missing in a row, you may
choose to ignore or remove that row from the dataset.
o Fill Values Manually: Domain experts can manually fill missing
values, but this is time-consuming.
o Use Global Constants: Fill missing values with a placeholder like
‘Unknown’ or ‘0’.
o Use Average/Mean Values: Replace missing numeric values (like
salary) with the average value of that column.
.IN
o Correct entries by referring to other reliable data sources or
consult domain experts. For example, David's age should be corrected
based on his actual DoB.
C
3. Handling Outliers:
o Investigate outliers to determine if they are errors or legitimate
N
data. Raju’s age of 136 may be a typo and can be corrected if the
correct age is known.
4. Noisy Data:
o
SY
Noisy data, like John's negative salary (-1500), can be corrected by
setting a minimum limit (e.g., salary cannot be below 0). In this case,
either correct or remove the invalid entry.
U
5. Inconsistent Values:
VT
o Standardize the format for fields like salary. For Andre and Raju,
change the text entries (‘Yes’) to numeric values or fill in missing data
using estimation techniques.
Methods for Handling Missing Data:
1. Ignoring the Tuple: Discard rows with missing data (not ideal
when a lot of data is missing).
2. Filling Manually: Domain experts analyze and fill the missing
values.
3. Global Constant: Fill missing values with ‘Unknown’ or ‘None’.
4. Attribute Mean: Replace missing numerical values with the
average for that attribute.
5. Class-based Mean: Use the mean value of the same class or group
to fill missing data.
.IN
Removal of Noisy or Outlier Data
In data analysis, noise refers to random errors or variations in
the data that can distort the results of analysis. Noise can affect
data accuracy and, if not removed, can lead to misleading
C
conclusions. Therefore, it's important to clean noisy data before
applying any analysis or machine learning algorithms.
N
What is Noise?
•
•
SY
Noise is random error or variance in measured values.
It can appear as outliers, missing values, or inconsistent data.
• Noise reduction is an essential step in data cleaning to improve
U
the quality of analysis.
Techniques for Removing Noise:
VT
o Replace all values in the bin with the mean (average) of the bin
values.
Example:
.IN
o Given data: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
o First, divide into bins of size 3:
▪ Bin 1: {12, 14, 19}
C
▪ Bin 2: {22, 24, 26}
▪ Bin 3: {28, 31, 34}
N
o Now apply smoothing by means (replace all values with the bin's
mean):
▪ Bin 1 (mean = 15): {15, 15, 15}
▪ Bin 2 (mean = 24): {24, 24, 24}
SY
▪ Bin 3 (mean ≈ 31): {31, 31, 31}
U
o Explanation: Each value in the bin is replaced by the mean of the
bin to smooth the data.
VT
2. Smoothing by Medians:
o Replace all values in the bin with the median of the bin values (the
middle value when the data is sorted).
Example:
o Given the same data and bins:
▪ Bin 1 (median = 14): {14, 14, 14}
▪ Bin 2 (median = 24): {24, 24, 24}
▪ Bin 3 (median = 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the median,
which reduces the effect of outliers or extreme values.
3. Smoothing by Bin Boundaries:
o Replace each value in the bin with the closest boundary value
(minimum or maximum value in the bin).
Example:
o Given the same data and bins:
▪ Bin 1 (boundary values: 12 and 19): {12, 12, 19}
.IN
▪ Bin 2 (boundary values: 22 and 26): {22, 22, 26}
▪ Bin 3 (boundary values: 28 and 34): {28, 34, 34}
o Explanation: For each bin, values are replaced by the closest
C
boundary value (either the minimum or maximum of that bin).
Example: In Bin 1, the original data was {12, 14, 19}. The
N
o
boundaries are 12 and 19, so the value 14 is closer to 12, and it's
replaced by 12.
.IN
2. z-Score
Min-Max Procedure It is a normalization technique where each
variable V is normalized by its difference with the minimum value
C
divided by the range to a new range, say 0–1. Often, neural networks
require this kind of normalization. The formula to implement this
N
normalization is given as:
SY
Here max-min is the range. Min and max are the minimum and
U
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.
VT
Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max
procedure and map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The
new min and new max are 0 and 1, respectively. The mapping can be
done using Eq. (2.1) as:
.IN
C
N
SY
So, it can be observed that the marks {88, 90, 92, 94} are mapped to
the new range {0, 0.33, 0.66, 1}. Thus, the Min-Max normalization
range is between 0 and 1.
U
z-Score Normalization This procedure works by taking the
VT
Here, s is the standard deviation of the list V and m is the mean of the
list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the
marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the
list V are 20 and 10, respectively. So the z-scores of these marks are
.IN
C
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
N
Data Reduction
SY
Data reduction reduces data size but produces the same results. There
are different ways in which data reduction can be carried out such as
data aggregation, feature selection, and dimensionality reduction.
U
2.4 DESCRIPTIVE STATISTICS
VT
.IN
Every attribute should be associated with a value. This process is
C
called measurement. The type of attribute determines the data types,
often referred to as measurement scale types.
N
The data types are shown in Figure 2.1.
SY
U
VT
.IN
value.
Numeric or Qualitative Data It can be divided into two categories.
They are interval type and ratio type.
• Interval Data – Interval data is a numeric data for which the
C
differences between values are meaningful. For example, there is a
difference between 30 degrees and 40 degrees. Only the permissible
N
operations are + and -.
• Ratio Data – For ratio data, both differences and ratio are meaningful.
SY
The difference between the ratio and interval data is the position of
zero in the scale. For example, take the Centigrade-Fahrenheit
conversion. The zeroes of both scales do not match.
Hence, these are interval data.
U
VT
.IN
C
2.5 UNIVARIATE DATA ANALYSIS AND VISUALIZATION
Univariate analysis is the simplest form of statistical analysis. As the
N
name indicates, the dataset has only one variable. A variable can be
called as a category. Univariate does not deal with cause or
SY
relationships. The aim of univariate analysis is to describe data and
find patterns. Univariate data description involves finding the
frequency distributions, central tendency measures, dispersion or
variation, and shape of the data.
U
2.5.1 Data Visualization
VT
Bar Chart A Bar chart (or Bar graph) is used to display the frequency
distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help
to explain the counts of nominal data. It also helps in comparing the
frequency of different groups. The bar chart for students' marks {45,
60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown below in Figure
2.3.
.IN
C
N
SY
Pie Chart These are equally helpful in illustrating the univariate data.
The percentage frequency distribution of students' marks {22, 22, 40,
40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.
U
VT
range 76-100 is 2.
.IN
C
N
SY
Histogram conveys useful information like nature of data and its
mode. Mode indicates the peak of dataset. In other words, histograms
U
can be used as charts to show frequency, skewness present in the data,
and shape.
VT
Dot Plots These are similar to bar charts. They are less clustered as
compared to bar charts, as they illustrate the bars only with single
points. The dot plot of English marks for five students with ID as {1, 2,
3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The
advantage
is that by visual inspection one can find out who got more marks.
.IN
C
N
2.5.2 Central Tendency
SY
Therefore, a condensation or summary of the data is necessary. This
U
makes the data analysis easy and simple. One such summary is called
central tendency. Thus, central tendency can explain the
VT
For example, the mean of the three numbers 10, 20, and 30 is 20
.IN
distribution, mid values of the range are taken for computation. This
is illustrated in the following computation. In weighted mean, the
mean is computed by adding the product of proportion and group
mean. It is mostly used when the sample sizes are unequal.
C
• Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or
observations. Geometric mean
N
is the Nth root of the product of N items. The formula for computing
geometric mean is given as follows:
SY
Here, n is the number of items and xi are values. For example, if the
U
values are 6 and 8, the geometric mean is given as In larger cases,
computing geometric mean is difficult. Hence, it is usually calculated
as:
VT
.IN
Median class is that class where N/2th item is present. Here, i is the
class interval of the median class and L1 is the lower limit of median
class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
C
3. Mode – Mode is the value that occurs more frequently in the
dataset. In other words, the value that has the highest frequency is
N
called mode.
2.5.3 Dispersion SY
The spreadout of a set of data around the central tendency (mean,
median or mode) is called dispersion. Dispersion is represented by
various ways such as range, variance, standard deviation, and
U
standard error. These are second order measures. The most common
measures of the dispersion data are listed below:
VT
.IN
below Xi. For example, median is 50th percentile and can be denoted
as Q0.50. The 25th percentile is called first quartile (Q1) and the 75th
percentile is called third quartile (Q3). Another measure that is useful
to measure dispersion is Inter Quartile Range (IQR). The IQR is the
C
difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
N
Outliers are normally the values falling apart at least by the amount
1.5 × IQR above the third quartile or below the first quartile.
SY
Interquartile is defined by Q0.75 – Q0.25. (2.10)
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34},
U
find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the
VT
median. The first quartile is median of the scores below the mean i.e.,
{12, 14, 19, 22}. Hence, it’s the median of the list below 24. In this case,
the median is the average of the second and third values, that is, Q0.25
= 16.5. Similarly, the third quartile is the median of the values above
the median, that is {26, 28, 31, 34}. So, Q0.75 is the average of the
seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13
13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots are
useful for describing 5-point summary. The Box plot for the set is given
in Figure 2.7.
.IN
C
N
2.5.4 Shape SY
Skewness and Kurtosis (called moments) indicate the
symmetry/asymmetry and peak location of the dataset.
U
Skewness
The measures of direction and degree of symmetry are called
VT
.IN
Also, the following measure is more commonly used to measure
skewness. Let X1, X2, …, XN be a set of ‘N’ values or observations then
the skewness can be given as:
C
Here, m is the population mean and s is the population standard
deviation of the univariate data. Sometimes, for bias correction
N
instead of N, N - 1 is used.
Kurtosis SY
Kurtosis also indicates the peaks of data. If the data is high peak, then
it indicates higher kurtosis and vice versa. Kurtosis is measured using
U
the formula given below:
VT
given as:
.IN
C
Coefficient of Variation (CV)
Coefficient of variation is used to compare datasets with different
N
units. CV is the ratio of standard deviation and mean, and %CV is the
percentage of coefficient of variations.
split into a ’stem’ and a ’leaf’. The last digit is usually the leaf and digits
to the left of the leaf mostly form the stem. For example, marks 45 are
divided into stem 4 and leaf 5 in Figure 2.9. The stem and leaf plot for
the English subject marks, say, {45, 60, 60, 80, 85} is given in Figure
2.9.
It can be seen from Figure 2.9 that the first column is stem and the
second column is leaf. For the given English marks, two students with
60 marks are shown in stem and leaf plot as stem-6 with 2 leaves with
0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below
n Figure 2.10.
.IN
C
N
SY
U
VT