0% found this document useful (0 votes)

12 views44 pages

Module 3

The document contains lecture notes on Artificial Intelligence and Machine Learning for the B.E. VI Semester at Sai Vidya Institute of Technology. It covers the need for machine learning, its applications, the relationship between machine learning and other fields, and the importance of data in the machine learning process. Key concepts include the definitions of machine learning, its benefits, and comparisons with artificial intelligence and statistics.

Uploaded by

parvithac31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views44 pages

Module 3

Uploaded by

parvithac31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Sri Sai Vidya Vikas Shikshana Samithi ®

SAI VIDYA INSTITUTE OF TECHNOLOGY

(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in

LECTURE NOTES
ON

Artificial Intelligence and

Machine Learning (BCS501)
2024 – 2025

B. E VI Semester

PAVITHRA B
Assistant Professor
Department of CSE(DATA SCIENCE)
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in

Module 3
INTRODUCTION TO MACHINE LEARNING & UNDERSTANDING DATA

Introduction to machine learning: Need for Machine Learning, Machine Learning

Explained, and Machine Learning in relation to other fields, Types of Machine Learning.
Challenges of Machine Learning, Machine Learning process, Machine Learning applications.
Understanding Data: What is data, types of data, Big data analytics and types of analytics,
Big data analytics framework, Descriptive statistics, univariate data analysis and visualization
Textbook 2:
Chapter: 1 and 2.1 to 2.5

1. Need for Machine Learning

o As a sub-discipline of AI, machine learning provides companies with the means to process
large amounts of data and draw conclusions that can be used to make informed decisions. This
technological shift is changing various sectors by doing work faster, better, and with less
human intervention.

o For instance, retail firms are using machine learning to offer personalized experiences to
customers, while financial institutions are using algorithms to assess risk.

o Machine learning provides organizations with the tools to enhance accuracy, automate
processes, and achieve greater operational efficiency.

o Its ability to analyse complex datasets and deliver actionable insights allows businesses to
make smarter decisions while reducing costs. From tailoring customer experiences to scaling
operations with minimal effort, machine learning supports critical business priorities such as
improving workflows, forecasting trends, and optimizing resource allocation.

o Companies leveraging machine learning are better equipped to meet demands and discover
new opportunities.

o Benefits of machine learning in organizations:

o Improved accuracy and insights

o Tailored customer experiences
o Automating routine tasks
o Predictive capabilities for better planning

Pavithra B, Dept. of CSE(DS),SVIT 2

Fig. 1: The Knowledge Pyramid

o Data: This is the raw, unprocessed facts and figures that are collected from various
sources. Data can be structured or unstructured and may include text, numbers,
images, audio, and video.
o Information: Data becomes information when it is organized, processed, and
interpreted in a meaningful way. The information provides context and relevance to
data and enables decision-making and action.
o Knowledge: Knowledge is the understanding gained from information, through
analysis, interpretation, and synthesis. Knowledge is often based on experience,
expertise, and intuition, and enables more complex decision-making and problem-
solving.
o Intelligence: An actionable form of knowledge is called as Intelligence or
Intelligence is the applied knowledge for actions
o Wisdom: Knowledge is the understanding gained from information, through
analysis, interpretation, and synthesis. Knowledge is often based on experience,
expertise, and intuition, and enables more complex decision-making and problem-
solving.

2. Machine Learning Explained

• Machine learning is a branch of artificial intelligence that enables algorithms to uncover

hidden patterns within datasets.
• It allows them to predict new, similar data without explicit programming for each task.
• Machine learning finds applications in diverse fields such as image and speech recognition,
natural language processing, recommendation systems, fraud detection, portfolio
optimization, and automating tasks.
• Definition by Arthur Lee Samuel for Machine learning is as follow:

Pavithra B, Dept. of CSE(DS),SVIT 3

Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in
“Field of study that gives computers the ability to learn without being explicitly
programmed. “
• Conventional programming:
o In Conventional programming also known as imperative or procedural programming, is a
programming paradigm that uses explicit statements to describe a sequence of actions that
the computer must take to accomplish a specific task.
o It is a manual process that requires a programmer to create the rules or logic of the program.
The programmer needs to code the rules and write lines of code manually using a
conventional procedural language such as assembly language or a high-level language such
as C, C++, Java, JavaScript, Python, etc
o But, certain applications, especially those requiring real-time processing or handling large
datasets, can face performance bottlenecks. Optimizing code for speed and efficiency is
often necessary but can be complex.
• Expert System:
o An expert system is an artificial intelligence system that emulates the decision-making
ability of a human expert.
o It is designed to solve complex problems by reasoning through bodies of knowledge,
represented mainly as if–then rules rather than through conventional procedural code.
o Expert systems can perform specific tasks with expert-like efficiency by applying
predefined rules to analyse information and generate conclusions.
o They can provide specialist advice or decision-making automation, assist in problem-
solving, and help identify errors or risks.
o For example, MYCIN, an early expert system, helped identify bacterial infections and
recommend antibiotics.
o But, the effectiveness of an expert system depends on the completeness and accuracy of
the knowledge base. If the knowledge is outdated or incomplete, the system’s performance
may be compromised.
• Machine Learning approach:
o Machine learning is important because it allows computers to learn from data and improve
their performance on specific tasks without being explicitly programmed.
o This ability to learn from data and adapt to new situations makes machine learning
particularly useful for tasks that involve large amounts of data, complex decision-making,
and dynamic environments.
o As human takes decisions from experience, computers make models based on extracted
patterns in the input data and then use these data-filled models for prediction and to take
decisions. For computers, these learnt model is equivalent to human experience. This is
shown in following figure

Pavithra B, Dept. of CSE(DS),SVIT 4

3. Machine Learning in relation to other fields

3.1 Machine Learning and Artificial Intelligence

• Artificial Intelligence: AI is the overarching field that aims to create systems capable of
performing tasks that typically require human intelligence, such as reasoning, problem-
solving, and decision-making.
• Machine Learning: ML is a subset of AI that focuses on enabling systems to learn from
data and improve their performance over time without being explicitly programmed. It
includes techniques like supervised learning, unsupervised learning, and reinforcement
learning.
• Deep Learning: Deep learning is a further subset of ML that uses artificial neural
networks to process and learn from large amounts of data.
• The relationship of AI with machine learning is as follows:

Pavithra B, Dept. of CSE(DS),SVIT 5

Relationship of AI with machine learning

• The key difference between AI and ML are as follows:
Artificial Intelligence Machine Learning
AI is a broader field focused on creating systems ML is a subset of AI that focuses on teaching
that mimic human intelligence, including machines to learn patterns from data and
reasoning, decision-making, and problem- improve over time without explicit
solving. programming
The main goal of AI is to develop machines that ML focuses on finding patterns in data and
can perform complex tasks intelligently, similar using them to make predictions or decisions.
to how humans think and act. It aims to help systems improve automatically
with experience.
AI systems aim to simulate human intelligence ML focuses on training systems for specific
and can perform tasks across multiple domains. tasks, such as prediction or classification.
AI aims to create systems that can think, learn, ML aims to create systems that learn from data
and make decisions autonomously. and improve their performance for a particular
task.
AI has a wider application range, including ML applications are typically narrower,
problem-solving, decision-making, and focused on tasks like pattern recognition and
autonomous systems predictive modeling.
AI can operate with minimal human ML requires human involvement for data
intervention, depending on its complexity and preparation, model training, and optimization
design.
AI produces intelligent behavior, such as ML generates predictions or classifications
driving safely, responding to customer queries, based on data, such as predicting house prices,
or diagnosing diseases, and can adapt to identifying objects in images, or categorizing
changing scenarios. emails.
AI involves broader goals, including natural ML focuses specifically on building models
language processing, vision, and reasonin that identify patterns and relationships in data
Examples: Robotics, virtual assistants like Siri, Examples: Recommender systems, fraud
autonomous vehicles, and intelligent chatbots. detection, stock price forecasting, and social
media friend suggestions.

Pavithra B, Dept. of CSE(DS),SVIT 6

3.2 Machine Learning and Statistics

•Machine learning (ML) and statistics are closely related fields, as both focus on analyzing
and interpreting data to make predictions or decisions. However, they differ in their goals,
approaches, and applications
• Statistics: Often requires assumptions about data, like assuming it follows a normal
distribution. It focuses on simpler models with less computational complexity.
• Statistics: Primarily focuses on explaining the data—answering questions like "Why is
this happening?" or "What is the relationship between these variables?" It emphasizes
interpretability and hypothesis testing.
• Machine Learning: Works well with vast, high-dimensional datasets, even when data
doesn't conform to predefined patterns. ML algorithms (like neural networks) can
uncover complex and non-linear relationships in data.
• Machine Learning: Goes beyond explanations to build predictive models. Its goal is to
make accurate predictions or decisions, even if the model itself is complex or not easily
interpretable.
3.3 Machine Learning and Data Science, Data Mining and Data Analytics
• Data science
o It is a multidisciplinary area that employs scientific techniques, procedures,
algorithms, and systems to derive insights from structured and unstructured data.
o It combines aspects of mathematics, statistics, computer science, and domain
expertise to interpret and solve complex problems.
o Data science aims to derive actionable insights from data, enabling organizations
to make informed decisions.
• Data analytics
o It examines, cleans, transforms, and interprets data to discover meaningful
patterns, insights, and information that can inform decision-making.
o Data analysts play a crucial role in this process by applying various techniques
and tools to extract valuable insights from data.
o The role as a data analyst is closely related to data analytics, as they are
responsible for data analysis, exploratory data analysis (EDA), and deriving
actionable insights from data.
• Data mining
o It is commonly a part of the data science pipeline. But unlike the latter, data
mining is more about techniques and tools used to unfold patterns in data that
were previously unknown and make data more usable for analysis.
o Data Mining focuses more on exploratory analysis, while Machine Learning
emphasizes predictive capabilities.
• Big Data
o It refers to extremely large and complex datasets that traditional data processing
methods cannot efficiently handle.
o These datasets are generated at an unprecedented volume, velocity, and variety as
its characteristics.
▪ Volume

Pavithra B, Dept. of CSE(DS),SVIT 7

• The relationship is summarized in the following diagram

4. Types of Machine Learning

In Machine Learning, "learning" refers to the process by which a model improves its
performance at a specific task by analysing data. Instead of being explicitly programmed
with fixed instructions, the model "learns" patterns, relationships, and behaviors from the
data it is exposed to.

Pavithra B, Dept. of CSE(DS),SVIT 8

Types of Machine Learning

4.1 Supervised Learning

• Supervised learning is a type of machine learning method in which we provide sample
labelled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
• The system creates a model using labelled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing
a sample data to check whether it is predicting the exact output or not.
• The goal of supervised learning is to map input data with the output data.
• The supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher.
• The example of supervised learning is spam filtering.
• Supervised learning is summarized through the following diagram

Supervised Learning Algorithm

• Supervised learning can be grouped further in two categories of algorithms:

▪ Classification:
▪ Classification deals with predicting categorical target variables, which represent
discrete classes or labels.
▪ Classification algorithms learn to map the input features to one of the predefined
classes.

Pavithra B, Dept. of CSE(DS),SVIT 9

▪ Regression:
▪ A regression problem is when the output variable is a real value, such as ‘dollars’
or ‘weight’
▪ deals with predicting continuous target variables, which represent numerical
values. For example, predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression algorithms learn to map
the input features to a continuous numerical value.
▪ Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
• Advantages:-
▪ Supervised Learning models can have high accuracy as they are trained on
labelled data.
▪ The process of decision-making in supervised learning models is often
interpretable.
▪ It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
▪ Helps to optimize performance criteria with the help of experience.
▪ Supervised machine learning helps to solve various types of real-world
computation problems.
• Disadvantages:-
▪ It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
▪ It can be time-consuming and costly as it relies on labeled data only.
▪ It may lead to poor generalizations based on new data.
• Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
▪ Image classification: Identify objects, faces, and other features in images.
▪ Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
▪ Speech recognition: Convert spoken language into text.
▪ Recommendation systems: Make personalized recommendations to users.

Pavithra B, Dept. of CSE(DS),SVIT 10

4.2 Unsupervised Learning

• Unsupervised learning is a type of machine learning technique in which an algorithm

discovers patterns and relationships using unlabelled data.
• Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labelled target outputs.
• The primary goal of Unsupervised learning is often to discover hidden patterns,
similarities, or clusters within the data, which can then be used for various purposes, such
as data exploration, visualization, dimensionality reduction, and more.
• Unsupervised machine learning is often used by researchers and data scientists to identify
patterns within large, unlabelled data sets quickly and efficiently.
• Unsupervised learning is summarized through the following diagram

Unsupervised Learning Algorithm

• There are two main categories of unsupervised learning that are mentioned below:
o Clustering
o Association
• Clustering

Pavithra B, Dept. of CSE(DS),SVIT 11

Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in
o Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labeled examples.
o Here are some clustering algorithms:
▪ K-Means Clustering algorithm
▪ Mean-shift algorithm
▪ DBSCAN Algorithm
▪ Principal Component Analysis
▪ Independent Component Analysis
o
• Association
o Association rule learning is a technique for discovering relationships between
items in a dataset. It identifies rules that indicate the presence of one item implies
the presence of another item with a specific probability.
o Here are some association rule learning algorithms:
▪ Apriori Algorithm
▪ Eclat
▪ FP-growth Algorithm
• Advantages of Unsupervised Machine Learning
o It helps to discover hidden patterns and various relationships between the data.
o Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
o It does not require labeled data and reduces the effort of data labeling.
• Disadvantages of Unsupervised Machine Learning
o Without using labels, it may be difficult to predict the quality of the model’s
output.
o Cluster Interpretability may not be clear and may not have meaningful
interpretations.
o It has techniques such as autoencoders and dimensionality reduction that can be
used to extract meaningful features from raw data.
• Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
o Clustering: Group similar data points into clusters.
o Anomaly detection: Identify outliers or anomalies in data.
o Dimensionality reduction: Reduce the dimensionality of data while preserving
its essential information.
o Recommendation systems: Suggest products, movies, or content to users based
on their historical behavior or preferences.
o Topic modeling: Discover latent topics within a collection of documents.
o Density estimation: Estimate the probability density function of data.
o Image and video compression: Reduce the amount of storage required for
multimedia content.
o Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
o Market basket analysis: Discover associations between products.

Pavithra B, Dept. of CSE(DS),SVIT 12

4.3 Semi-supervised Learning

• Semi-supervised learning is a machine learning approach that incorporates elements from
both supervised learning (which uses labelled data) and unsupervised learning (which
uses unlabelled data).
• It is particularly useful when acquiring labelled data is expensive or labour-intensive but
there's an abundance of unlabelled data available.
• In practice, semi-supervised learning algorithms work with a small amount of labelled
data supplemented by a larger amount of unlabelled data.
• The goal is to leverage the structure and distribution of the unlabelled data to better
understand the overall dataset and make more accurate predictions.
• Semi-supervised learning aims to create a model that learns from both the guidance
provided by the labelled data and the freedom to explore and make inferences from the
unlabelled data.
• Advantages of Semi-Supervised Learning •
o Simplicity: The algorithms are generally straightforward and user-friendly,
making them easy to grasp. •
o Efficiency: These algorithms can be highly efficient, as they require fewer
labelled instances.
o Problem-Solving: They address certain limitations of both supervised and
unsupervised learning by utilizing a mix of labelled and unlabelled data.
• Disadvantages of Semi-Supervised Learning
o Stability: The results across iterations can be inconsistent, leading to potential
instability in the model's performance.
o Data Limitations: These algorithms are not well-suited for network-level data
which requires different analytical approaches.
o Lower Accuracy: The accuracy of semi-supervised learning may not match that
of fully supervised learning, especially if the labelled data is not representative of
the entire dataset.
• Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
o Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled images.
o Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
o Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of unlabeled
audio.
o Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions (labeled
data) with a wealth of unlabeled user behavior data.

Pavithra B, Dept. of CSE(DS),SVIT 13

4.4 Reinforcement Learning

• Reinforcement learning (RL) is a branch of machine learning where an AI agent learns

to make decisions by executing actions and receiving feedback(reward or penalty),
optimizing for a cumulative reward.
• This method stands out because it does not need labelled data. Instead, the agent learns
through the outcomes of its actions, akin to trial and error.
• The RL process parallels human experiential learning, much like how a child learns from
daily interactions.
• For instance, in video games, the agent learns to play better by making moves (actions)
in various situations (states) and receiving scores (rewards or penalties) for those moves.
• Reinforcement learning has diverse applications across fields such as game theory,
operations research, and multi-agent systems.
• Typically, RL problems are framed as Markov Decision Processes, where the agent's
interaction with its environment involves a cycle of states, actions, and feedback, leading
to new states and learning opportunities.
• Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
▪ Positive reinforcement
o Rewards the agent for taking a desired action.
o Encourages the agent to repeat the behavior.
o Examples: Giving a treat to a dog for sitting, providing a point in a game for a
correct answer.
▪ Negative reinforcement
o Removes an undesirable stimulus to encourage a desired behavior.
o Discourages the agent from repeating the behavior.
o Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty
by completing a task.
• Advantages of Reinforcement Learning
o Complex Problem Solving: Reinforcement learning is adept at tackling complex,
real-world problems that conventional algorithms may struggle with.
o Human-like Learning: The RL model mimics human learning processes, often
leading to highly accurate and efficient solutions.
o Long-Term Benefits: RL is designed to maximize not just immediate rewards but
also long-term gains, making it effective for strategies that unfold over time.
• Disadvantages of Reinforcement Learning
o Not Suited for Simple Tasks: RL algorithms may be unnecessarily complex for
simple problems, where simpler algorithms could suffice.

Pavithra B, Dept. of CSE(DS),SVIT 14

5. Challenges of Machine Learning

1. Data Quality and Quantity
• Challenge: Data is the foundation of any machine learning model, and the quality and
quantity of the data you have can significantly impact the performance of your models.
Poor-quality data, such as data with missing values, noise, or inconsistencies, can lead to
inaccurate predictions. Additionally, insufficient data can prevent the model from
learning the underlying patterns, leading to overfitting or underfitting.

• How to Overcome: To address data quality issues, it’s essential to invest time in data
cleaning and preprocessing. Techniques such as imputation for missing values, outlier
detection, and normalization can help improve data quality. For situations where you lack
sufficient data, consider data augmentation, synthetic data generation, or using transfer
learning to leverage pre-trained models on similar datasets. Collaborating with domain
experts can also help in understanding the nuances of the data and improving its quality.

2. Choosing the Right Algorithms

• Challenge: With a wide range of machine learning algorithms available, selecting the
right one for your specific problem can be daunting. Different algorithms have varying
strengths, weaknesses, and suitability depending on the nature of the data and the task at
hand. Using an inappropriate algorithm can lead to suboptimal model performance.

Pavithra B, Dept. of CSE(DS),SVIT 15

3. Model Interpretability
• Challenge: Machine learning models, especially complex ones like deep neural networks,
are often referred to as “black boxes” because it can be challenging to understand how
they arrive at their predictions. This lack of interpretability can be a significant barrier
when trying to build trust in the model’s decisions, particularly in fields like healthcare
and finance where transparency is critical.

• How to Overcome: To improve model interpretability, consider using simpler models like
decision trees or linear models, which are inherently more interpretable. For more
complex models, techniques such as LIME (Local Interpretable Model-agnostic
Explanations) and SHAP (SHapley Additive exPlanations) can help provide insights into
how the model makes its predictions. Additionally, feature importance scores can help
identify which variables have the most influence on the model’s output.

4. Overfitting and Underfitting

• Challenge: Overfitting occurs when a model learns the noise in the training data rather
than the actual underlying patterns, leading to poor generalization on new data.
Underfitting, on the other hand, happens when the model is too simple to capture the
complexity of the data, resulting in poor performance even on the training data.

• How to Overcome: To combat overfitting, consider techniques such as cross-validation,

regularization (e.g., L1 or L2 regularization), and pruning for decision trees. Additionally,
ensuring that your training data is representative of the real-world data the model will
encounter is crucial. For underfitting, try increasing the model complexity by adding
more features, using more powerful algorithms, or tuning the model’s hyperparameters.
Monitoring learning curves can also help you identify and address both overfitting and
underfitting early in the model development process.
5. Bias/Variance
• Bias refers to the error introduced by approximating a real-world problem, which may
be too complex, by a simpler model. High bias typically leads to underfitting, where the
model performs poorly on both training and test data.
Causes:
• Using a model that is too simple for the data (e.g., linear regression for non-
linear data).
• Not having enough features or input variables.

Pavithra B, Dept. of CSE(DS),SVIT 16

6. Machine Learning process

Machine learning uses the CRISP-DM(Cross Industry Standard Process-Data Mining)
process model for learning process. The steps involved in the process are described as
below:

• Business Understanding: This step involves understanding the problem that needs to be
solved and defining the objectives of the data mining project. This includes identifying
the business problem, understanding the goals and objectives of the project, and defining
the KPIs that will be used to measure success. This step is important because it helps
ensure that the data mining project is aligned with business goals and objectives.

• Data Understanding: This step involves collecting and exploring the data to gain a better
understanding of its structure, quality, and content. This includes understanding the
sources of the data, identifying any data quality issues, and exploring the data to identify
patterns and relationships. This step is important because it helps ensure that the data is
suitable for analysis.

• Data Preparation: This step involves preparing the data for analysis. This includes
cleaning the data to remove any errors or inconsistencies, transforming the data to make
it suitable for analysis, and integrating the data from different sources to create a single
dataset. This step is important because it ensures that the data is in a format that can be
used for modeling.

• Modeling: This step involves building a predictive model using machine learning
algorithms. This includes selecting an appropriate algorithm, training the model on the
data, and evaluating its performance. This step is important because it is the heart of the
data mining process and involves developing a model that can accurately predict
outcomes on new data.

Pavithra B, Dept. of CSE(DS),SVIT 17

• Deployment: This step involves deploying the model into the production environment.
This includes integrating the model into existing systems and processes to make
predictions in real-time. This step is important because it allows the model to be used in
a practical setting and to generate value for the organization.

A Machine Learning Process

Pavithra B, Dept. of CSE(DS),SVIT 18

7. Machine Learning applications

Pavithra B, Dept. of CSE(DS),SVIT 19

8. What is data?
• Data is any collection of facts or measurements that can be recorded and analysed.
• It can exist in various formats, such as numbers, text, images, sounds, or videos.
• It's the foundation upon which machine learning models are built and improved.
• Why data is important ?
▪ Data helps in make better decisions.
▪ Data helps in solve problems by finding the reason for underperformance.
▪ Data helps one to evaluate the performance.
▪ Data helps one improve processes.
▪ Data helps one understand consumers and the market.
• Data is available from different sources such as flat files, databases and data warehouses.
• Operational data: data that is produced by your organization's day to day operations.
Things like customer, inventory, and purchase data fall into this category.
• Non-operational data: the data that is used for decision making
• Data by itself is meaningless unless it is labelled and processed to generate certain
information.
• Processed data is called Information that includes Patterns, Associations and relationships
among data
• Elements of Big Data
▪ Small data: Data whose volume is less and can be stored and processed by small-scale
computer is called Small data
▪ Big data: Big data refers to extremely large and complex data sets that traditional data
processing tools cannot handle efficiently.
▪ The data to be said as Big data must satisfy the following characteristics

▪ Big data is often described by the following characteristics, known as the "Five Vs":
• Volume: The sheer amount of data generated and stored. Big data typically
involves terabytes, petabytes, or even zettabytes of data. . For instance: Social
media platforms handle billions of posts, likes, and comments daily. Social media
platforms handle billions of posts, likes, and comments daily.
• Velocity: The speed at which data is generated and processed. Big data often
requires real-time or near-real-time processing2. Examples include stock market
transactions, sensor data from connected cars, or website clicks.
• Variety: The different types of data, including structured, semi-structured, and
unstructured data such as text, images, audio, and video. It also refers to
heterogeneous sources.
• Veracity: The quality and reliability of the data. Ensuring data accuracy and
trustworthiness is crucial for meaningful analysis. Cleaning and validating big
data is a vital process to ensure meaningful results.
• Value: The potential insights and benefits that can be derived from analysing big
data. The ultimate goal is to extract valuable information that can drive decision-
making and innovation

Pavithra B, Dept. of CSE(DS),SVIT 20

9. Types of data
• Data comes in many different forms, each defined by its unique characteristics,
sources and formats. Understanding these distinctions can allow for more effective
organization and data analysis, as different types of data support different use cases.
• Furthermore, a single data point or data set can fall under multiple categories. For
example, structured and quantitative, unstructured, qualitative and so on.
• Some of the most common types of data include:
o Quantitative data
o Qualitative data
o Structured data
o Unstructured data
o Semi-structured data
o Metadata
o Big data
• Quantitative data
o Quantitative data consists of values that can be measured numerically.
Examples of quantitative data include discrete data points (such as the
number of products sold) or continuous data points (such as temperature or
revenue figures).
o Quantitative data is often structured, making it easy to analyse using
mathematical tools and algorithms.
o Common use cases of quantitative data include trend forecasting, statistical
analysis, budgeting, pattern identification and performance measurement.
• Qualitative data
o Qualitative data is descriptive and non-numerical, capturing characteristics,
concepts or experiences that numbers cannot measure. Examples include
customer feedback, product reviews and social media comments.
o Qualitative data can be structured (such as coded survey responses) or
unstructured (such as free-text responses or interview transcripts).
o Common use cases for qualitative data include understanding customer
behaviour, market trends and user experiences.
• Structured data
o Structured data is organized in a clear, defined format, often stored in
relational databases or spreadsheets. It can consist of both quantitative (such
as sales figures) and qualitative data (such as categorical labels like “yes or
no”).
o Examples of structured data include customer records and financial reports,
where data fits neatly into rows and columns with predefined fields.
o The highly organized nature of structured data allows for quick querying and
data analysis, making it useful for business intelligence systems and
reporting processes.

Pavithra B, Dept. of CSE(DS),SVIT 21

Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in
o Structured data encountered in machine learning are as follow:
▪ Record data refers to information that is organized and stored in a record
format, often within databases or tables. Each record typically represents
a single entity or event and consists of multiple fields (columns), where
each field contains a specific attribute or property of the entity.
▪ A data matrix is a structured representation of data arranged in rows and
columns. It is often used in mathematics, statistics, and machine learning
to organize and analyze information. Rows: Represent observations or
instances. Each row corresponds to a single data record. Columns:
Represent variables, features, or attributes of the data.
▪ Graph data represents information in the form of entities (nodes) and
relationships (edges) between them. This structure is widely used to model
complex systems where connections or relationships are significant, such
as social networks, transportation networks, and knowledge graphs.
▪ Ordered data refers to data that is organized in a specific, meaningful
sequence or order. The arrangement of the data is crucial because it
conveys information about the context or relationships between elements.
• Examples of Ordered Data:
o Time Series Data: Time series data refers to a sequence of data points
collected or recorded over time at consistent intervals. This type of data
is widely used in various fields to observe trends, patterns, and
relationships.
o Sequence Data: Sequence data refers to information that is arranged in
a specific order where the arrangement itself is significant. It's
commonly used in contexts where the relationship between data points
depends on their position in the sequence.
o Spatial Data: Spatial data (also known as geospatial data) refers to
information that describes the physical location and shape of objects in
space, as well as the relationships between them. It's widely used in
fields like geography, urban planning, transportation, and
environmental science.
• Unstructured data
o Unstructured data lacks a strictly defined format. It often comes in complex
forms such as text documents, images and videos. Unstructured data can
include both qualitative information (such as customer comments) and
quantitative elements (such as numerical values embedded in text).
o Examples of unstructured data include emails, social media content and
multimedia files.
o Unstructured data doesn’t easily fit into traditional relational databases, and
organizations often use techniques such as natural language processing
(NLP) and machine learning to streamline analysis of unstructured data.
o Unstructured data often plays a key role in sentiment analysis, complex
pattern recognition and other advanced analytics projects.
• Semi-structured data
o Semi-structured data blends elements of structured and unstructured data. It
doesn't follow a rigid format but can include tags or markers that make it

Pavithra B, Dept. of CSE(DS),SVIT 22

10. Big data analytics and types of analytics

• Big data analytics is the process of examining, analyzing, and interpreting large and
complex datasets (big data) to uncover patterns, correlations, trends, and insights. It
involves advanced techniques and tools to handle the sheer scale, variety, and velocity of
data that traditional data processing systems cannot manage effectively.
• Big data analytics has applications across industries—predicting customer behavior in
retail, improving healthcare outcomes, optimizing supply chains, and even driving urban
planning.
• The different types of Big data analytics are as follows:
1. Descriptive Analytics (What happened?)
Definition:
Descriptive analytics helps organizations understand past trends and events by
summarizing historical data. It provides a clear picture of what has happened using
dashboards, reports, and visualizations.
How It Works:
• Uses data aggregation and data mining techniques.
• Relies on tools like charts, graphs, and dashboards to present findings.
Examples:
• A company analyzing monthly sales data to see which products sold the most.
• A website tracking the number of visitors per day and their geographical locations.
• A hospital reviewing patient admission trends over the last five years.
Tools Used:
Excel, Google Analytics, Tableau, Power BI

Pavithra B, Dept. of CSE(DS),SVIT 23

2. Diagnostic Analytics (Why did it happen?)

Definition:
This type of analytics digs deeper into data to find the root cause of past events. It
helps answer "why" something happened by looking for correlations and patterns.
How It Works:
• Uses statistical techniques such as drill-down analysis, data discovery, and
correlations.
• Helps businesses understand the factors affecting their performance.
Examples:
• A company analyzing why customer churn increased last quarter by looking at
customer complaints and satisfaction scores.
• A retail store studying why sales dropped in a particular region by checking
weather patterns, competitor activity, or marketing efforts.
• A hospital investigating why patient readmission rates are rising by analyzing
patient records and treatment histories.
Tools Used:
SQL, Python, R, SAS, Splunk

3. Predictive Analytics (What is likely to happen?)

Definition:
Predictive analytics uses historical data, machine learning, and statistical models to
forecast future trends and outcomes. It helps organizations anticipate events and make
proactive decisions.
How It Works:
• Uses techniques like regression analysis, neural networks, and machine learning
algorithms.
• Requires large datasets and computational power to make accurate predictions.
Examples:
• An e-commerce company predicting which products a customer is likely to buy
next based on browsing history.
• A bank assessing a customer’s likelihood of defaulting on a loan using credit history
and spending patterns.
• A weather forecasting system predicting hurricanes based on atmospheric data.
Tools Used:
Python (Scikit-learn, TensorFlow), IBM Watson, AWS Machine Learning,
RapidMiner

4. Prescriptive Analytics (What should we do about it?)

Definition:
Prescriptive analytics takes predictive insights and provides recommendations on what
actions to take. It is the most advanced form of analytics, using AI and optimization
algorithms to suggest the best course of action.

Pavithra B, Dept. of CSE(DS),SVIT 24

11. Big data analytics framework

• Many frameworks have been proposed to analysis of big data. The four layer Big Data
Analytics is as discussed below:
o Data Collection Layer:
▪ Purpose: This layer collects, stores, and manages raw data from various sources.
Key Components:
▪ Data Sources: Structured (databases), semi-structured (logs, XML), and
unstructured (social media, images, videos).
▪ Data Storage: Traditional databases (SQL), NoSQL databases (MongoDB,
Cassandra), Data Lakes (Hadoop, AWS S3).
▪ Data Processing: Batch processing (Hadoop MapReduce) and real-time
processing (Apache Kafka, Apache Spark).
▪ Example: An e-commerce company collects data from website clicks, customer
transactions, and social media interactions.

o Data Management Layer

▪ Purpose: This layer processes, cleans, and transforms raw data into a usable
format.
Key Components:
▪ ETL (Extract, Transform, Load): Extracts data from different sources, cleans it,
and loads it into a data warehouse (e.g., Apache Nifi, Talend, AWS Glue).
▪ Data Governance: Ensures data quality, security, and compliance (GDPR,
HIPAA).
▪ Data Warehouses: Stores processed data for analysis (Google BigQuery,
Snowflake, Amazon Redshift).
▪ Example: A telecom company cleans call records, removes duplicate data, and
organizes it for billing analytics.

o Data Analysis Layer

▪ Purpose: This layer applies different analytics techniques to generate insights.
Key Components:

Pavithra B, Dept. of CSE(DS),SVIT 25

o Data Visualization Layer

▪ Purpose: This layer presents insights through reports, dashboards, and business
applications.
Key Components:
▪ Data Visualization: Power BI, Tableau, Looker.
▪ Business Applications: CRM, ERP, AI-powered recommendation systems.
▪ User Interaction: Mobile apps, web dashboards, AI chatbots.
▪ Example: A retailer uses a real-time dashboard to track inventory levels and predict
demand fluctuations.

12. Descriptive statistics

• Descriptive statistics refers to a set of statistical methods used to summarize and present
data in a clear and understandable form. It involves organizing raw data into tables,
charts, or numerical summaries, making it easier to identify patterns, trends, and
anomalies.
• Its primary aim is to define and analyze the fundamental characteristics of a dataset
without making sweeping generalizations or assumptions about the entire data set.
• It helps to organize the data in a more manageable and readable format.
• Dataset and Datatypes:

o The different data types are listed in the following chart

Pavithra B, Dept. of CSE(DS),SVIT 26

o Categorical or Qualitative data

• Categorical data can be put in groups or categories using names or labels. This
grouping is typically generated using a matching procedure based on data
attributes and similarities between these qualities.
• Each piece of a categorical dataset, also known as qualitative data, may be
assigned to only one category based on its qualities, and each category is mutually
exclusive.
• There are two primary categories of categorical data:
• Nominal data:
• This is the data category that names or labels its categories. It has features
resembling a noun and is occasionally referred to as naming data.
• Example: Colors(red,blue,green), gender(M/F), and types of animals.
•
• Ordinary data:
• Elements with rankings, orders, or rating scales are included in this category
of categorical data. Nominal data can be ordered and counted but not
measured.
• Example: Education level (high school, college, graduate), temperature
(high,medium,low)

• Numerical or Quantitative data

• Data expressed in numerical terms rather than in natural language descriptions are
called numerical data. It can only be gathered in numerical form, keeping its name.
This numerical data type also referred to as quantitative data can be used to
measure a person’s height, weight, IQ, etc.

Pavithra B, Dept. of CSE(DS),SVIT 27

o Third way of categorizing data

Pavithra B, Dept. of CSE(DS),SVIT 28

o
13. Univariate Data Analysis and Visualization
• Univariate data refers to a type of data in which each observation or data point
corresponds to a single variable.
• It involves the measurement or observation of a single characteristic or attribute for each
individual or item in the dataset.
• Analysing univariate data is the simplest form of analysis in statistics.
• Key points in Univariate analysis:
• No Relationships: Univariate analysis focuses solely on describing and summarizing the
distribution of the single variable. It does not explore relationships between variables or
attempt to identify causes.
• Descriptive Statistics: Descriptive statistics, such as measures of central tendency (mean,
median, mode) and measures of dispersion (range, standard deviation), are commonly
used in the analysis of univariate data.
• Visualization: Histograms, box plots, and other graphical representations are often used
to visually represent the distribution of the single variable
• Types of univariate analyses
The following are the most common types of summary statistics:
o Measures of dispersion: these numbers describe how evenly the values are
distributed in a dataset. The range, standard deviation, interquartile range, and
variance are some examples.
o Range: the difference between the highest and lowest value in a data set.
o Standard deviation: an average measure of the spread.
o Interquartile range: the spread of the middle 50% of the values.
o Measures of central tendency: these numbers describe the location of the center
point of a data set or the middle value of the data set. The mean, median and mode
are the three main measures of central tendency.

• Data Visualization
o Data visualization is the graphical representation of information and data.
o By using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
o It provides an excellent way for employees or business owners to present data to non-
technical audiences without confusion.
o Businesses, researchers, and analysts rely on visualization tools to interpret large
datasets efficiently, detect anomalies, and drive strategic insights.
o By transforming numbers into meaningful visuals, data visualization enhances
comprehension, storytelling, and informed decision-making across industries.
o Why is Data Visualization Important?
o Simplifying Complex Data: Data visualization simplifies large and complex
datasets by converting them into graphs, charts, and interactive visuals for
decision-making more efficient and accessible
o Enhancing Data Interpretation: Visual representation of data enables better
pattern recognition and insight extraction. Businesses can identify

Pavithra B, Dept. of CSE(DS),SVIT 29

o Advantages of Data Visualization:

o Faster Data Comprehension – Data visualization enables users to grasp
complex information quickly by presenting it in an intuitive format. Instead of
analyzing spreadsheets or raw data, decision-makers can interpret charts, graphs,
and dashboards efficiently.
o Identification of Correlations and Anomalies – Visualizing data helps detect
patterns, relationships, and outliers that might not be apparent in raw datasets.
This is particularly useful in fraud detection, market analysis, and performance
tracking.
o Enhanced User Engagement – Well-designed visualizations make data more
accessible and engaging. Interactive dashboards and AI-driven visualizations
allow users to explore datasets dynamically, improving data-driven storytelling
and communication.

o Disadvantages of Data Visualization:

o Risk of Misinterpretation Due to Poor Design – Incorrect use of chart types,
misleading scales, or color schemes can distort insights, leading to flawed
conclusions. Overly complex visualizations may also confuse users rather than
clarify data.
o Data Bias and Misleading Representations – Visualization tools can
inadvertently amplify biases if data selection is not handled carefully. Cherry-
picked data or improper aggregation may result in skewed narratives that mislead
decision-makers.
o Performance Challenges with Large Datasets – Handling massive datasets in
real-time dashboards requires significant computational power and efficient
algorithms. Poorly optimized visualizations can slow down performance, making
analysis cumbersome for users.
o Data Visualization and Big Data - As organizations generate massive volumes
of data, visualization plays a crucial role in analyzing, interpreting, and extracting
insights from large-scale datasets. Without visual representation, handling big

Pavithra B, Dept. of CSE(DS),SVIT 30

o Types of Data Visualizations

▪ Bar Charts – Used to compare categorical data, making them ideal for sales
analysis, survey results, and financial reporting.
▪ Pie Chart: A circular chart with triangular segments that shows data as a
percentage of a whole.
▪ Histogram: A type of bar chart that split a continuous measure into different
bins to help analyze the distribution.
▪ Dot Plots: The Wilkinson dot plot represents the distribution of continuous data
in the form of individual dots for each value.
▪ Heatmaps – Represent data density using color gradients, commonly used in
website analytics and geographic data analysis.

o Bar Charts
o Bar charts enable us to compare numerical values like integers and
percentages.
o They use the length of each bar to represent the value of each variable. For
example, bar charts show variations in categories or subcategories scaling
width or height across simple, spaced bars, or rectangles.
o Bar charts can represent quantitative measures vertically, on the y-axis, or
horizontally, on the x-axis. The style depends on the data and on the
questions the visualization addresses.
o The qualitative dimension will go along the opposite axis of the quantitative
measure.
o Bar charts typically have a baseline of zero. If another starting point is used,
the axis should be clearly labeled to avoid misleading the viewer.
o A good bar chart will follow these rules:
o The base starts at zero
o The axes are labeled clearly
o Colors are consistent and defined
o The bar chart does not display too many bars
o When creating a bar chart, do not:
o Make each bar a different width
o Cram too many bars into subcategories
o Leave the axes unlabeled
o Following is the simple example:

Pavithra B, Dept. of CSE(DS),SVIT 31

o Pie Chart
o A pie chart is a circular statistical graphic that visually displays data in a circular
graph. It is divided into slices to illustrate numerical proportion, where the arc
length of each slice (and consequently its central angle and area) is proportional
to the quantity it represents.
o Pie charts are commonly used to represent data using the attributes of circles,
spheres, and angular data to represent real-world information.
o Pie Chart Formula
We know that the total value of the pie is always 100%. It is also known that a
circle subtends an angle of 360°. Hence, the total of all the data is equal to 360°.
Based on these, there are two main formulas used in pie charts:
o To calculate the percentage of the given data, we use the formula:
(Frequency ÷ Total Frequency) × 100
o To convert the data into degrees we use the formula:
(Given Data ÷ Total value of Data) × 360°
o Following is the example of pie chart

o Histogram
o A histogram is the graphical representation of data where data is grouped into
continuous number ranges and each range corresponds to a vertical bar.
o The horizontal axis displays the number range.
o The vertical axis (frequency) represents the amount of data that is present in each
range.

Pavithra B, Dept. of CSE(DS),SVIT 32

Histogram of the Hospital data

o Dot plots
o A dot plot is a simple form of data visualization that consists of data points plotted as
dots on a graph with an x- and y-axis.
o These types of charts are used to graphically depict certain data trends or groupings. A
dot plot is similar to a histogram in that it displays the number of data points that fall
into each category or value on the axis, thus showing the distribution of a set of data.
o There are two types of dot plots: the Cleveland and Wilkinson dot plots.
o This type of charting method is commonly used by the Federal Open Market Committee
(FOMC).
o Dot plots are generally arranged with one axis showing the range of values or categories
along which the data points are grouped.
o The second axis shows the number of data points in each group. Dots may be vertically
or horizontally stacked to show how many are in each group for easy visual comparison.

Pavithra B, Dept. of CSE(DS),SVIT 33

o
o Central tendency
▪ A measure of central tendency is a single value that represents the center point of
a dataset. This value can also be referred to as “the central location” of a dataset.
▪ In statistics, there are three common measures of central tendency:
▪ The mean
▪ The median
▪ The mode
▪ Each of these measures finds the central location of a dataset using different
methods. Depending on the type of data you’re analyzing, one of these three
measures may be better to use than the other two.
▪ The main functions of measures of central tendency are as follows:
1) They provide a summary figure with the help of which the central location of the
whole data can be explained. When we compute an average of a certain group we
get an idea about the whole data.
2) Large amount of data can be easily reduced to a single figure. Mean, median and
mode can be computed for a large data and a single figure can be derived.
3) When mean is computed for a certain sample, it will help gauge the population
mean.
4) The results obtained from computing measures of central tendency will help in
making certain decisions. This holds true not only to decisions with regard to
research but could have applications in varied areas like policy making, marketing
and sales and so on.
5) Comparison can be carried out based on single figures computed with the help of
measures of central tendency. For example, with regard to performance of
students in mathematics test, the mean marks obtained by girls and the mean
marks obtained by boys can be compared.

o Mean or Arithmetic Mean

o Mean for sample is denoted by symbol ‘M or x̅ (‘x-bar’)’ and mean for population is
denoted by ‘µ’ (mu).

Pavithra B, Dept. of CSE(DS),SVIT 34

o Properties of Mean

o Advantages of Mean
1) The definition of mean is rigid which is a quality of a good measure of central
tendency.
2) It is not only easy to understand but also easy to calculate.
3) All the scores in the distribution are considered when mean is computed.
4) Further mathematical calculations can be carried out on the basis of mean.
5) Fluctuations in sampling are least likely to affect mean.

o Limitations of Mean
1) Outliers or extreme values can have an impact on mean.
2) When there are open ended classes, such as 10 and above or below 5, mean cannot
be computed. In such cases median and mode can be computed. This is mainly
because in such distributions mid point cannot be determined to carry out
calculations.

Pavithra B, Dept. of CSE(DS),SVIT 35

o Median
o Median is a point in any distribution below and above which lie half of the scores.
Median is also referred to as P50.
o The symbol for median is ‘Md’. As stated by Bordens and Abbott, ‘median is the
middle score in an ordered distribution’.
o If we take the example discussed earlier of the marks obtained by 100 students in a
psychology test, these marks are to be arranged in an order, either ascending or
descending. The middle score in this distribution is then identified as median.
Though this would seem easy for an odd number of scores, in case of even number
of scores a certain procedure is followed that will be discussed when we learn how
to compute median later in this unit.
o Properties of Median
1) Central Tendency When compared to mean, median is less sensitive to extreme
scores or outliers.
2) When a distribution is skewed or is asymmetrical median can be adequately used.
3) When a distribution is open ended, that is, actual score at one end of the distribution
is not known, then median can be computed.
o Advantages
▪ The definition of median is rigid which is a quality of a good measure of central
tendency.
▪ It is easy to understand and calculate.
▪ It is not affected by outliers or extreme scores in data.
▪ Unless the median falls in an open ended class, it can be computed for grouped data
with open ended classes.
▪ In certain cases it is possible to identify median through inspection as well as
graphically.
o Disadvantages
▪ Some statistical procedures using median are quite complex. Computation of
median can be time consuming when large data is involved because the data needs
to be arranged in an order before median is computed.
▪ Median cannot be computed exactly when an ungrouped data is even. In such cases,
median is estimated as mean of the scores in the middle of the distribution.
▪ It is not based on each and every score in the distribution.
▪ It can be affected by sampling fluctuations and thus can be termed as less stable
than mean.

o Mode
o Mode is the value of that observation which has a maximum frequency corresponding to it.
In other, that observation of the data occurs the maximum number of times in a dataset.

Pavithra B, Dept. of CSE(DS),SVIT 36

o
o where,
o l is the lower class limit of modal class
o h is the class size
o f1 is the frequency of modal class
o f0 is the frequency of class which proceeds the modal class
o f2 is the frequency of class which succeeds the modal class
o Properties of Mode
1) Mode can be used with variables that can be measured on
nominal scale.
2) Mode is easier to compute than mean and media. But it is not
used often because of lack of stability from one sample to
another and also because a single set of data may possibly
have more than one mode. Also, when there is more than one
mode, then the modes cannot be termed to adequately
measure central location.
3) Mode is not affected by outliers or extreme scores.
o Advantages of Mode
1) It is not only easy to comprehend and calculate but it can also be determined by mere
inspection.
2) It can be used with quantitative as well as qualitative data.
3) It is not affected by outliers or extreme scores.
4) Even if a distribution has one or more than one open ended classe(s), mode can easily be
computed.

o Disadvantages of Mode
1) It is sometimes possible that the scores in the data vary from each other and in such cases
the data may have no mode.
2) Mode cannot be rigidly defined.
3) In case of bimodal, trimodal or multimodal distribution, interpretation and comparison
becomes difficult.
4) Mode is not based on the whole distribution.
5) It may not be possible to compute further mathematical procedures based on mode.
6) Sampling fluctuations can have an impact on mode.

o Dispersion

Pavithra B, Dept. of CSE(DS),SVIT 37

o Range
o Range refers to the difference between the largest and the smallest values in a given
data set. The higher the value of the range, the higher the spread in data.
o R=L-D

o where,
o L = Largest value
o S = Smallest value
o Standard Deviation
o Standard deviation
o It is a fundamental concept in statistics that measures the dispersion of data points
which defines the extent to which data points in a dataset deviate from the mean,
providing a clear sense of the variability or spread within the data.
o Mean Deviation is used to tell us about the scatter of the data. The lower degree of
deviation tells us that the observations xi are close to the mean value and the
depression is low. In contrast, the higher degree of deviation tells us that the
observations xi are far from the mean value and the dispersion is high.
o There are two standard deviation formulas that are used to find the Standard
Deviation of any given data set. They are,
o Population Standard Deviation Formula

Pavithra B, Dept. of CSE(DS),SVIT 38

o Interquartile Range
o The Quartiles divide a set of data series into four equal parts. The four parts are namely,
First Quartile (Q1), Second Quartile (Q2), Third Quartile (Q3), and Fourth Quartile
(Q4). We also know Second Quartile (Q2) as the Median of the data series as it also
divides the data into two equal parts.
o First quartile: It divides the data such that one-fourth or the 25% of the values are
below it and the remaining three-fourth or 75% are above it. We also call the first
quartile as a lower quartile. We denote it as Q1.
o Second quartile: It divides the data or observations into two equal parts so that 50%
of the observations are below it and 50% of the observations are above it. We also
know it as Median. We denote it as Q2.
o Third quartile: It divides the series such that three-fourth or 75% of the observation
is below it and the remaining one-fourth or 25% of the observations are above it.
We also call the third quartile as the upper quartile. We denote it as Q3.

o The interquartile range measures the difference between the first quartile (25th
percentile) and third quartile (75th percentile) in a dataset. This represents the spread of
the middle 50% of values.
o we should use the interquartile range when we’re interested in understanding the spread
between the 75th percentile and 25th percentile of a dataset.
o Formula for Interquartile Range
IQR = Q3 – Q1

o Find the inter-quartile range for the following data: 56, 14, 84, 21, 85, 2, 35, 74, 66, 52,
45

o Solution: Arranging the data in ascending order: 2, 14, 21, 35, 45, 52, 56, 66, 74, 84, 85,

o Q1=(N+1)/4th term
=(11+1)/4th term

Pavithra B, Dept. of CSE(DS),SVIT 39

o Five Point Summary and Box plots

o Five number summary is a part of descriptive statistics and consists of five values and
all these values will help us to describe the data.
1. The minimum value (the lowest value)
2. 25th Percentile or Q1
3. 50th Percentile or Q2 or Median
4. 75th Percentile or Q3
5. Maximum Value (the highest value)
o Box plots (also called box-and-whisker plots or box-whisker plots) give a good
graphical image of the concentration of the data. They also show how far the extreme
values are from most of the data.
o A box plot is constructed from five values: the minimum value, the first quartile, the
median, the third quartile, and the maximum value. We use these values to compare how
close other data values are to them.
o To construct a box plot, use a horizontal or vertical number line and a rectangular box.
o The smallest and largest data values label the endpoints of the axis.
o The first quartile marks one end of the box and the third quartile marks the other end of
the box.
o Approximately the middle 50 percent of the data fall inside the box.
o The "whiskers" extend from the ends of the box to the smallest and largest data values.
o The median or second quartile can be between the first and third quartiles, or it can be
one, or the other, or both. The box plot gives a good, quick picture of the data.
o Plot the Box-and-whisker plot for the following data

Pavithra B, Dept. of CSE(DS),SVIT 40

o Shape
o The shape of distribution provides helpful insights about the distribution. This includes
the distribution’s peaks, symmetry, uniformity, as well as its tendency to lean towards
the left or right corner.
o The shape of the distribution is a helpful feature that easily reflects the frequency of
values within given intervals. When given a distribution and its shape, here are other
helpful details we can learn about a data set from the shape of its distribution:
o Represents how spread out the data is across the range
o Helps identify which range the mean of the data set lies
o Highlights the range of a given data set

o Skewness
oSkewness is a measure that tells us how much a dataset deviates from a normal
distribution, which is a perfectly symmetrical bell-shaped curve. In simpler terms, it
shows whether the data points tend to cluster more on one side.
o Types of Skewness
▪ Positive skewness/Right Skewed
▪ Negative skewness/Left Skewed
o Positive skewness/Right Skewed
o Positive skewness indicates that if the distribution’s tail is longer on the right side, we
say the data is positively skewed. This means there are a few unusually high values.
o This type of distribution is called right-skewed. When you measure this skewness, the
number you get is bigger than zero. Imagine looking at a graph of this data: the average
(mean) value is usually the highest, followed by the middle value (median), and then
the most common value (mode).

Pavithra B, Dept. of CSE(DS),SVIT 41

o Negative skewness/Left Skewed

o A negatively skewed distribution is one where the long tail extends to the left, known
as left-skewed. For such distributions, the skewness value is less than zero.
o The left tail of the distribution is longer or fatter than the right.
o The mean is less than the median, and the mode is greater than both mean and median.
o Higher values are clustered in the “hill” of the distribution, while extreme values are in
the long left tail.
o It is also known as left-skewed distribution.

o
o Formula to compute Skewness

o Another formula highly influenced by the works of Karl Pearson is the moment-
based formula to approximate skewness. It is more reliable and given as follows:

o Kurtosis
o Kurtosis focuses more on the height. It tells us how peaked or flat our normal
(or normal-like) distribution is
o High kurtosis indicates:

Pavithra B, Dept. of CSE(DS),SVIT 42

o
o Formula to calculate Kurtosis is

o Mean Absolute Deviation(MAD)

o Mean Absolute Deviation is one of the metrics of statistics that helps us find out
the average spread of the data i.e., Mean Absolute Deviation shows the average
distance of the observation of the dataset from the mean of the dataset. It is helpful

Pavithra B, Dept. of CSE(DS),SVIT 43

where,

xi represents the each observation of the dataset,

μ is the mean of the data set, and
n is the number of observations in the data set.

Pavithra B, Dept. of CSE(DS),SVIT 44

CCTV Training Sample Questions
0% (2)
CCTV Training Sample Questions
4 pages
Artificial Intelligence in Business Management
100% (9)
Artificial Intelligence in Business Management
385 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Quasi-Anechoic Measurement of Loudspeakers Using Beamforming Method
No ratings yet
Quasi-Anechoic Measurement of Loudspeakers Using Beamforming Method
7 pages
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
50% (2)
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
11 pages
1 Info Packet 1 (April 2022)
No ratings yet
1 Info Packet 1 (April 2022)
10 pages
Format Kti Internasional
No ratings yet
Format Kti Internasional
3 pages
FST Aerospace Parts Cross Reference Brochure
No ratings yet
FST Aerospace Parts Cross Reference Brochure
6 pages
Cross-Border Interbank Payment System (CIPS)
No ratings yet
Cross-Border Interbank Payment System (CIPS)
40 pages
Self-Disclosure on WeChat: A Study
No ratings yet
Self-Disclosure on WeChat: A Study
12 pages
Fire Safety Report for Asansika Hostel
No ratings yet
Fire Safety Report for Asansika Hostel
5 pages
Lecture Ch4 Performance
No ratings yet
Lecture Ch4 Performance
25 pages
224 Block-3
No ratings yet
224 Block-3
129 pages
Coal Conversions Facts 2013
No ratings yet
Coal Conversions Facts 2013
4 pages
1 s2.0 S0196890421011778 Main
No ratings yet
1 s2.0 S0196890421011778 Main
12 pages
Machine Learning: Abstract
No ratings yet
Machine Learning: Abstract
11 pages
HT Test Reopts July CTPT 2020
No ratings yet
HT Test Reopts July CTPT 2020
6 pages
Basic Into To The Course Ai
No ratings yet
Basic Into To The Course Ai
40 pages
Load Schedules For Lighting Panel Admin BLD 6 - 11-2023
No ratings yet
Load Schedules For Lighting Panel Admin BLD 6 - 11-2023
1 page
Presentation Matrix COSEC For End Users
No ratings yet
Presentation Matrix COSEC For End Users
147 pages
Comparing Functions Answered
No ratings yet
Comparing Functions Answered
14 pages
Machine Learning The Ultimate Guide To Understand Artificial Intelligence and Big
100% (2)
Machine Learning The Ultimate Guide To Understand Artificial Intelligence and Big
162 pages
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
No ratings yet
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
22 pages
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
No ratings yet
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
45 pages
Machine Learning Presentation
No ratings yet
Machine Learning Presentation
3 pages
Basic Components of AI
No ratings yet
Basic Components of AI
17 pages
Artificial Int Text Book
100% (1)
Artificial Int Text Book
100 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
J Jfoodeng 2018 01 016
No ratings yet
J Jfoodeng 2018 01 016
8 pages
Boq1 Replacing Ac at Central Pharmacy Fo
No ratings yet
Boq1 Replacing Ac at Central Pharmacy Fo
11 pages
AI & ML: A Researcher's Guide
0% (1)
AI & ML: A Researcher's Guide
5 pages
ML Unit 1 Material
No ratings yet
ML Unit 1 Material
26 pages
1.2.1 ML Intro
No ratings yet
1.2.1 ML Intro
18 pages
Expert Systems & Machine Learning
No ratings yet
Expert Systems & Machine Learning
33 pages
AIML Module-2.2 Notes
No ratings yet
AIML Module-2.2 Notes
55 pages
ML Module2-Chapter 1
No ratings yet
ML Module2-Chapter 1
50 pages
Akvárium Klub Ticket Guidelines
No ratings yet
Akvárium Klub Ticket Guidelines
1 page
6 Steps How To Jump Start A Car
No ratings yet
6 Steps How To Jump Start A Car
1 page
Towards Large-Scale Small Object Detection: Survey and Benchmarks
No ratings yet
Towards Large-Scale Small Object Detection: Survey and Benchmarks
24 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
Philippine Digitalization Bills
No ratings yet
Philippine Digitalization Bills
13 pages
BCA Course Module
No ratings yet
BCA Course Module
11 pages
Question Paper LH - MLT
No ratings yet
Question Paper LH - MLT
93 pages
IE PSheet
No ratings yet
IE PSheet
3 pages
DocScanner Sep 27, 2024 9-01 AM
No ratings yet
DocScanner Sep 27, 2024 9-01 AM
24 pages
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
No ratings yet
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
9 pages
Presentation Eurl Afaq
No ratings yet
Presentation Eurl Afaq
9 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
34 pages
Ch-1 Notes
No ratings yet
Ch-1 Notes
7 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Iml Material
No ratings yet
Iml Material
139 pages
BROCHURE
No ratings yet
BROCHURE
8 pages
Antim Prahar 2024 AI and ML For Business
No ratings yet
Antim Prahar 2024 AI and ML For Business
43 pages
Module-1 ML
No ratings yet
Module-1 ML
113 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
Abtik Group
No ratings yet
Abtik Group
23 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
ML Unit 1
No ratings yet
ML Unit 1
26 pages
AIML Text Book 6th Semister
No ratings yet
AIML Text Book 6th Semister
226 pages
ML Notes (BCS602)
No ratings yet
ML Notes (BCS602)
186 pages
Unit III - Artificial Intelligence
No ratings yet
Unit III - Artificial Intelligence
24 pages
1 - Business Artificial IntelegenceIntroduction
No ratings yet
1 - Business Artificial IntelegenceIntroduction
25 pages
Bcs602-Ml-mod 1 & 2
No ratings yet
Bcs602-Ml-mod 1 & 2
235 pages
Survey
No ratings yet
Survey
15 pages
REPP Module 4 - Tidal Power and OTEC
No ratings yet
REPP Module 4 - Tidal Power and OTEC
10 pages
Freedom-Ticket 01-2 Notes
No ratings yet
Freedom-Ticket 01-2 Notes
10 pages
Module 5
No ratings yet
Module 5
1 page
ML - Module 1
No ratings yet
ML - Module 1
52 pages
Module 1
No ratings yet
Module 1
38 pages
Machine Learning Notes - Concepts, Algorithms
No ratings yet
Machine Learning Notes - Concepts, Algorithms
171 pages
1.2.1 ML Intro
No ratings yet
1.2.1 ML Intro
15 pages
Hive Commands Cheat Sheet
No ratings yet
Hive Commands Cheat Sheet
2 pages
Question Bank
No ratings yet
Question Bank
2 pages
Module 4 & 5 Question Bank
No ratings yet
Module 4 & 5 Question Bank
1 page
BCS602 Module 1 PDF
No ratings yet
BCS602 Module 1 PDF
36 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Module 1 Machine Learning
No ratings yet
Module 1 Machine Learning
56 pages
ML Module 01
No ratings yet
ML Module 01
28 pages
IA1
No ratings yet
IA1
3 pages
Module 1
No ratings yet
Module 1
44 pages
M1 Q&a
No ratings yet
M1 Q&a
26 pages
ML Module1 Notes
No ratings yet
ML Module1 Notes
176 pages
MAchine Learning Notes
No ratings yet
MAchine Learning Notes
41 pages
Module-1 Notes-Bcs602
No ratings yet
Module-1 Notes-Bcs602
32 pages
ML Module 1
No ratings yet
ML Module 1
52 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
17 pages
ML Module 1 (Bcs602)
No ratings yet
ML Module 1 (Bcs602)
48 pages
ML Module 1 Final
No ratings yet
ML Module 1 Final
134 pages
2 ML
No ratings yet
2 ML
93 pages
AIvs ML
No ratings yet
AIvs ML
31 pages
AIML Module-3
No ratings yet
AIML Module-3
31 pages
Pks Machine Learning Module 1
No ratings yet
Pks Machine Learning Module 1
106 pages
Chapter 1
No ratings yet
Chapter 1
58 pages
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
42 pages
ML - Lecture - 1 29th July, 2025-1
No ratings yet
ML - Lecture - 1 29th July, 2025-1
78 pages
Module 1
No ratings yet
Module 1
100 pages
20250915160728-23aii503 Module 1
No ratings yet
20250915160728-23aii503 Module 1
18 pages