Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
LECTURE NOTES
ON
Artificial Intelligence and
Machine Learning (BCS501)
2024 – 2025
B. E VI Semester
PAVITHRA B
Assistant Professor
Department of CSE(DATA SCIENCE)
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail: [email protected] * URL www.saividya.ac.in
Module 3
INTRODUCTION TO MACHINE LEARNING & UNDERSTANDING DATA
Introduction to machine learning: Need for Machine Learning, Machine Learning
Explained, and Machine Learning in relation to other fields, Types of Machine Learning.
Challenges of Machine Learning, Machine Learning process, Machine Learning applications.
Understanding Data: What is data, types of data, Big data analytics and types of analytics,
Big data analytics framework, Descriptive statistics, univariate data analysis and visualization
Textbook 2:
Chapter: 1 and 2.1 to 2.5
1. Need for Machine Learning
o As a sub-discipline of AI, machine learning provides companies with the means to process
large amounts of data and draw conclusions that can be used to make informed decisions. This
technological shift is changing various sectors by doing work faster, better, and with less
human intervention.
o For instance, retail firms are using machine learning to offer personalized experiences to
customers, while financial institutions are using algorithms to assess risk.
o Machine learning provides organizations with the tools to enhance accuracy, automate
processes, and achieve greater operational efficiency.
o Its ability to analyse complex datasets and deliver actionable insights allows businesses to
make smarter decisions while reducing costs. From tailoring customer experiences to scaling
operations with minimal effort, machine learning supports critical business priorities such as
improving workflows, forecasting trends, and optimizing resource allocation.
o Companies leveraging machine learning are better equipped to meet demands and discover
new opportunities.
o Benefits of machine learning in organizations:
o Improved accuracy and insights
o Tailored customer experiences
o Automating routine tasks
o Predictive capabilities for better planning
Pavithra B, Dept. of CSE(DS),SVIT 2
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Efficiency and cost optimization
o Competitive edge through innovation
o Seamless scalability and flexibility
o Let’s understand the basic terminologies through the following Knowledge pyramid
Fig. 1: The Knowledge Pyramid
o Data: This is the raw, unprocessed facts and figures that are collected from various
sources. Data can be structured or unstructured and may include text, numbers,
images, audio, and video.
o Information: Data becomes information when it is organized, processed, and
interpreted in a meaningful way. The information provides context and relevance to
data and enables decision-making and action.
o Knowledge: Knowledge is the understanding gained from information, through
analysis, interpretation, and synthesis. Knowledge is often based on experience,
expertise, and intuition, and enables more complex decision-making and problem-
solving.
o Intelligence: An actionable form of knowledge is called as Intelligence or
Intelligence is the applied knowledge for actions
o Wisdom: Knowledge is the understanding gained from information, through
analysis, interpretation, and synthesis. Knowledge is often based on experience,
expertise, and intuition, and enables more complex decision-making and problem-
solving.
2. Machine Learning Explained
• Machine learning is a branch of artificial intelligence that enables algorithms to uncover
hidden patterns within datasets.
• It allows them to predict new, similar data without explicit programming for each task.
• Machine learning finds applications in diverse fields such as image and speech recognition,
natural language processing, recommendation systems, fraud detection, portfolio
optimization, and automating tasks.
• Definition by Arthur Lee Samuel for Machine learning is as follow:
Pavithra B, Dept. of CSE(DS),SVIT 3
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
“Field of study that gives computers the ability to learn without being explicitly
programmed. “
• Conventional programming:
o In Conventional programming also known as imperative or procedural programming, is a
programming paradigm that uses explicit statements to describe a sequence of actions that
the computer must take to accomplish a specific task.
o It is a manual process that requires a programmer to create the rules or logic of the program.
The programmer needs to code the rules and write lines of code manually using a
conventional procedural language such as assembly language or a high-level language such
as C, C++, Java, JavaScript, Python, etc
o But, certain applications, especially those requiring real-time processing or handling large
datasets, can face performance bottlenecks. Optimizing code for speed and efficiency is
often necessary but can be complex.
• Expert System:
o An expert system is an artificial intelligence system that emulates the decision-making
ability of a human expert.
o It is designed to solve complex problems by reasoning through bodies of knowledge,
represented mainly as if–then rules rather than through conventional procedural code.
o Expert systems can perform specific tasks with expert-like efficiency by applying
predefined rules to analyse information and generate conclusions.
o They can provide specialist advice or decision-making automation, assist in problem-
solving, and help identify errors or risks.
o For example, MYCIN, an early expert system, helped identify bacterial infections and
recommend antibiotics.
o But, the effectiveness of an expert system depends on the completeness and accuracy of
the knowledge base. If the knowledge is outdated or incomplete, the system’s performance
may be compromised.
• Machine Learning approach:
o Machine learning is important because it allows computers to learn from data and improve
their performance on specific tasks without being explicitly programmed.
o This ability to learn from data and adapt to new situations makes machine learning
particularly useful for tasks that involve large amounts of data, complex decision-making,
and dynamic environments.
o As human takes decisions from experience, computers make models based on extracted
patterns in the input data and then use these data-filled models for prediction and to take
decisions. For computers, these learnt model is equivalent to human experience. This is
shown in following figure
Pavithra B, Dept. of CSE(DS),SVIT 4
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
a)A learning system for humans b) A learning system for machine learning
o A learning system summarizes the raw data in a model. Where, the model is an implicit
description of patterns within the data in the form:
o Mathematical Equation
o Relational diagrams like trees/graphs
o Logical if/else rules
o Groupings called clusters
o Another definition by Tom Mitchell is:
“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”
o The learning in systems happens as described by the following steps:
➢ Collection of Data
➢ Abstract concepts are formed out of the collected data
➢ Generalization converts the abstraction into an actionable form of intelligence
➢ Evaluation checks the thoroughness of the models
3. Machine Learning in relation to other fields
3.1 Machine Learning and Artificial Intelligence
• Artificial Intelligence: AI is the overarching field that aims to create systems capable of
performing tasks that typically require human intelligence, such as reasoning, problem-
solving, and decision-making.
• Machine Learning: ML is a subset of AI that focuses on enabling systems to learn from
data and improve their performance over time without being explicitly programmed. It
includes techniques like supervised learning, unsupervised learning, and reinforcement
learning.
• Deep Learning: Deep learning is a further subset of ML that uses artificial neural
networks to process and learn from large amounts of data.
• The relationship of AI with machine learning is as follows:
Pavithra B, Dept. of CSE(DS),SVIT 5
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
Relationship of AI with machine learning
• The key difference between AI and ML are as follows:
Artificial Intelligence Machine Learning
AI is a broader field focused on creating systems ML is a subset of AI that focuses on teaching
that mimic human intelligence, including machines to learn patterns from data and
reasoning, decision-making, and problem- improve over time without explicit
solving. programming
The main goal of AI is to develop machines that ML focuses on finding patterns in data and
can perform complex tasks intelligently, similar using them to make predictions or decisions.
to how humans think and act. It aims to help systems improve automatically
with experience.
AI systems aim to simulate human intelligence ML focuses on training systems for specific
and can perform tasks across multiple domains. tasks, such as prediction or classification.
AI aims to create systems that can think, learn, ML aims to create systems that learn from data
and make decisions autonomously. and improve their performance for a particular
task.
AI has a wider application range, including ML applications are typically narrower,
problem-solving, decision-making, and focused on tasks like pattern recognition and
autonomous systems predictive modeling.
AI can operate with minimal human ML requires human involvement for data
intervention, depending on its complexity and preparation, model training, and optimization
design.
AI produces intelligent behavior, such as ML generates predictions or classifications
driving safely, responding to customer queries, based on data, such as predicting house prices,
or diagnosing diseases, and can adapt to identifying objects in images, or categorizing
changing scenarios. emails.
AI involves broader goals, including natural ML focuses specifically on building models
language processing, vision, and reasonin that identify patterns and relationships in data
Examples: Robotics, virtual assistants like Siri, Examples: Recommender systems, fraud
autonomous vehicles, and intelligent chatbots. detection, stock price forecasting, and social
media friend suggestions.
Pavithra B, Dept. of CSE(DS),SVIT 6
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
3.2 Machine Learning and Statistics
•Machine learning (ML) and statistics are closely related fields, as both focus on analyzing
and interpreting data to make predictions or decisions. However, they differ in their goals,
approaches, and applications
• Statistics: Often requires assumptions about data, like assuming it follows a normal
distribution. It focuses on simpler models with less computational complexity.
• Statistics: Primarily focuses on explaining the data—answering questions like "Why is
this happening?" or "What is the relationship between these variables?" It emphasizes
interpretability and hypothesis testing.
• Machine Learning: Works well with vast, high-dimensional datasets, even when data
doesn't conform to predefined patterns. ML algorithms (like neural networks) can
uncover complex and non-linear relationships in data.
• Machine Learning: Goes beyond explanations to build predictive models. Its goal is to
make accurate predictions or decisions, even if the model itself is complex or not easily
interpretable.
3.3 Machine Learning and Data Science, Data Mining and Data Analytics
• Data science
o It is a multidisciplinary area that employs scientific techniques, procedures,
algorithms, and systems to derive insights from structured and unstructured data.
o It combines aspects of mathematics, statistics, computer science, and domain
expertise to interpret and solve complex problems.
o Data science aims to derive actionable insights from data, enabling organizations
to make informed decisions.
• Data analytics
o It examines, cleans, transforms, and interprets data to discover meaningful
patterns, insights, and information that can inform decision-making.
o Data analysts play a crucial role in this process by applying various techniques
and tools to extract valuable insights from data.
o The role as a data analyst is closely related to data analytics, as they are
responsible for data analysis, exploratory data analysis (EDA), and deriving
actionable insights from data.
• Data mining
o It is commonly a part of the data science pipeline. But unlike the latter, data
mining is more about techniques and tools used to unfold patterns in data that
were previously unknown and make data more usable for analysis.
o Data Mining focuses more on exploratory analysis, while Machine Learning
emphasizes predictive capabilities.
• Big Data
o It refers to extremely large and complex datasets that traditional data processing
methods cannot efficiently handle.
o These datasets are generated at an unprecedented volume, velocity, and variety as
its characteristics.
▪ Volume
Pavithra B, Dept. of CSE(DS),SVIT 7
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
•
Refers to the sheer scale of data generated daily.
•
Examples: Petabytes (10¹⁵ bytes) of data from social media, IoT
devices, and online transactions.
▪ Velocity
• The speed at which data is generated, collected, and processed.
• Examples: Real-time data streams from sensors, financial markets,
or user interactions.
▪ Variety
• Refers to the different formats and types of data.
• Examples: Structured data (spreadsheets, databases), unstructured
data (text, images, videos), and semi-structured data (JSON,
XML).
• Pattern recognition
o It is the process of identifying and analysing patterns or structures in data to
extract meaningful information or make decisions.
o It is a critical component of fields like Machine Learning, Artificial Intelligence,
and Computer Vision.
• The relationship is summarized in the following diagram
4. Types of Machine Learning
In Machine Learning, "learning" refers to the process by which a model improves its
performance at a specific task by analysing data. Instead of being explicitly programmed
with fixed instructions, the model "learns" patterns, relationships, and behaviors from the
data it is exposed to.
Pavithra B, Dept. of CSE(DS),SVIT 8
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
Types of Machine Learning
4.1 Supervised Learning
• Supervised learning is a type of machine learning method in which we provide sample
labelled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
• The system creates a model using labelled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing
a sample data to check whether it is predicting the exact output or not.
• The goal of supervised learning is to map input data with the output data.
• The supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher.
• The example of supervised learning is spam filtering.
• Supervised learning is summarized through the following diagram
Supervised Learning Algorithm
• Supervised learning can be grouped further in two categories of algorithms:
▪ Classification:
▪ Classification deals with predicting categorical target variables, which represent
discrete classes or labels.
▪ Classification algorithms learn to map the input features to one of the predefined
classes.
Pavithra B, Dept. of CSE(DS),SVIT 9
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
▪ A classification problem is when the output variable is a category, such as ‘Red’ or
‘blue’, ‘disease’ or ‘no disease’
▪ Some of the key algorithms of Classification are:
• Support Vector Machine
• Random Forest
• Decision Tree
• K-Nearest Neighbors (KNN)
• Naive Bayes
▪ Regression:
▪ A regression problem is when the output variable is a real value, such as ‘dollars’
or ‘weight’
▪ deals with predicting continuous target variables, which represent numerical
values. For example, predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression algorithms learn to map
the input features to a continuous numerical value.
▪ Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
• Advantages:-
▪ Supervised Learning models can have high accuracy as they are trained on
labelled data.
▪ The process of decision-making in supervised learning models is often
interpretable.
▪ It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
▪ Helps to optimize performance criteria with the help of experience.
▪ Supervised machine learning helps to solve various types of real-world
computation problems.
• Disadvantages:-
▪ It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
▪ It can be time-consuming and costly as it relies on labeled data only.
▪ It may lead to poor generalizations based on new data.
• Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
▪ Image classification: Identify objects, faces, and other features in images.
▪ Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
▪ Speech recognition: Convert spoken language into text.
▪ Recommendation systems: Make personalized recommendations to users.
Pavithra B, Dept. of CSE(DS),SVIT 10
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
▪ Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
▪ Medical diagnosis: Detect diseases and other medical conditions.
▪ Fraud detection: Identify fraudulent transactions.
▪ Autonomous vehicles: Recognize and respond to objects in the environment.
▪ Email spam detection: Classify emails as spam or not spam.
▪ Quality control in manufacturing: Inspect products for defects.
▪ Credit scoring: Assess the risk of a borrower defaulting on a loan.
▪ Gaming: Recognize characters, analyze player behavior, and create NPCs.
▪ Customer support: Automate customer support tasks.
▪ Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
▪ Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
4.2 Unsupervised Learning
• Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabelled data.
• Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labelled target outputs.
• The primary goal of Unsupervised learning is often to discover hidden patterns,
similarities, or clusters within the data, which can then be used for various purposes, such
as data exploration, visualization, dimensionality reduction, and more.
• Unsupervised machine learning is often used by researchers and data scientists to identify
patterns within large, unlabelled data sets quickly and efficiently.
• Unsupervised learning is summarized through the following diagram
Unsupervised Learning Algorithm
• There are two main categories of unsupervised learning that are mentioned below:
o Clustering
o Association
• Clustering
Pavithra B, Dept. of CSE(DS),SVIT 11
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labeled examples.
o Here are some clustering algorithms:
▪ K-Means Clustering algorithm
▪ Mean-shift algorithm
▪ DBSCAN Algorithm
▪ Principal Component Analysis
▪ Independent Component Analysis
o
• Association
o Association rule learning is a technique for discovering relationships between
items in a dataset. It identifies rules that indicate the presence of one item implies
the presence of another item with a specific probability.
o Here are some association rule learning algorithms:
▪ Apriori Algorithm
▪ Eclat
▪ FP-growth Algorithm
• Advantages of Unsupervised Machine Learning
o It helps to discover hidden patterns and various relationships between the data.
o Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
o It does not require labeled data and reduces the effort of data labeling.
• Disadvantages of Unsupervised Machine Learning
o Without using labels, it may be difficult to predict the quality of the model’s
output.
o Cluster Interpretability may not be clear and may not have meaningful
interpretations.
o It has techniques such as autoencoders and dimensionality reduction that can be
used to extract meaningful features from raw data.
• Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
o Clustering: Group similar data points into clusters.
o Anomaly detection: Identify outliers or anomalies in data.
o Dimensionality reduction: Reduce the dimensionality of data while preserving
its essential information.
o Recommendation systems: Suggest products, movies, or content to users based
on their historical behavior or preferences.
o Topic modeling: Discover latent topics within a collection of documents.
o Density estimation: Estimate the probability density function of data.
o Image and video compression: Reduce the amount of storage required for
multimedia content.
o Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
o Market basket analysis: Discover associations between products.
Pavithra B, Dept. of CSE(DS),SVIT 12
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
o Image segmentation: Segment images into meaningful regions.
4.3 Semi-supervised Learning
• Semi-supervised learning is a machine learning approach that incorporates elements from
both supervised learning (which uses labelled data) and unsupervised learning (which
uses unlabelled data).
• It is particularly useful when acquiring labelled data is expensive or labour-intensive but
there's an abundance of unlabelled data available.
• In practice, semi-supervised learning algorithms work with a small amount of labelled
data supplemented by a larger amount of unlabelled data.
• The goal is to leverage the structure and distribution of the unlabelled data to better
understand the overall dataset and make more accurate predictions.
• Semi-supervised learning aims to create a model that learns from both the guidance
provided by the labelled data and the freedom to explore and make inferences from the
unlabelled data.
• Advantages of Semi-Supervised Learning •
o Simplicity: The algorithms are generally straightforward and user-friendly,
making them easy to grasp. •
o Efficiency: These algorithms can be highly efficient, as they require fewer
labelled instances.
o Problem-Solving: They address certain limitations of both supervised and
unsupervised learning by utilizing a mix of labelled and unlabelled data.
• Disadvantages of Semi-Supervised Learning
o Stability: The results across iterations can be inconsistent, leading to potential
instability in the model's performance.
o Data Limitations: These algorithms are not well-suited for network-level data
which requires different analytical approaches.
o Lower Accuracy: The accuracy of semi-supervised learning may not match that
of fully supervised learning, especially if the labelled data is not representative of
the entire dataset.
• Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
o Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled images.
o Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
o Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of unlabeled
audio.
o Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions (labeled
data) with a wealth of unlabeled user behavior data.
Pavithra B, Dept. of CSE(DS),SVIT 13
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Healthcare and Medical Imaging: Enhance medical image analysis by utilizing
a small set of labeled medical images alongside a larger set of unlabeled images.
•
4.4 Reinforcement Learning
• Reinforcement learning (RL) is a branch of machine learning where an AI agent learns
to make decisions by executing actions and receiving feedback(reward or penalty),
optimizing for a cumulative reward.
• This method stands out because it does not need labelled data. Instead, the agent learns
through the outcomes of its actions, akin to trial and error.
• The RL process parallels human experiential learning, much like how a child learns from
daily interactions.
• For instance, in video games, the agent learns to play better by making moves (actions)
in various situations (states) and receiving scores (rewards or penalties) for those moves.
• Reinforcement learning has diverse applications across fields such as game theory,
operations research, and multi-agent systems.
• Typically, RL problems are framed as Markov Decision Processes, where the agent's
interaction with its environment involves a cycle of states, actions, and feedback, leading
to new states and learning opportunities.
• Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
▪ Positive reinforcement
o Rewards the agent for taking a desired action.
o Encourages the agent to repeat the behavior.
o Examples: Giving a treat to a dog for sitting, providing a point in a game for a
correct answer.
▪ Negative reinforcement
o Removes an undesirable stimulus to encourage a desired behavior.
o Discourages the agent from repeating the behavior.
o Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty
by completing a task.
• Advantages of Reinforcement Learning
o Complex Problem Solving: Reinforcement learning is adept at tackling complex,
real-world problems that conventional algorithms may struggle with.
o Human-like Learning: The RL model mimics human learning processes, often
leading to highly accurate and efficient solutions.
o Long-Term Benefits: RL is designed to maximize not just immediate rewards but
also long-term gains, making it effective for strategies that unfold over time.
• Disadvantages of Reinforcement Learning
o Not Suited for Simple Tasks: RL algorithms may be unnecessarily complex for
simple problems, where simpler algorithms could suffice.
Pavithra B, Dept. of CSE(DS),SVIT 14
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Data and Computation Intensive: These algorithms require substantial amounts
of data and computational power to function effectively.
o Risk of State Overload: Excessive use of reinforcement learning can result in a
state space that is too large to manage effectively, potentially degrading the
performance of the model.
• Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
o Game Playing: RL can teach agents to play games, even complex ones.
o Robotics: RL can teach robots to perform tasks autonomously.
o Autonomous Vehicles: RL can help self-driving cars navigate and make
decisions.
o Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
o Healthcare: RL can be used to optimize treatment plans and drug discovery.
o Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
o Finance and Trading: RL can be used for algorithmic trading.
o Supply Chain and Inventory Management: RL can be used to optimize supply
chain operations.
5. Challenges of Machine Learning
1. Data Quality and Quantity
• Challenge: Data is the foundation of any machine learning model, and the quality and
quantity of the data you have can significantly impact the performance of your models.
Poor-quality data, such as data with missing values, noise, or inconsistencies, can lead to
inaccurate predictions. Additionally, insufficient data can prevent the model from
learning the underlying patterns, leading to overfitting or underfitting.
• How to Overcome: To address data quality issues, it’s essential to invest time in data
cleaning and preprocessing. Techniques such as imputation for missing values, outlier
detection, and normalization can help improve data quality. For situations where you lack
sufficient data, consider data augmentation, synthetic data generation, or using transfer
learning to leverage pre-trained models on similar datasets. Collaborating with domain
experts can also help in understanding the nuances of the data and improving its quality.
2. Choosing the Right Algorithms
• Challenge: With a wide range of machine learning algorithms available, selecting the
right one for your specific problem can be daunting. Different algorithms have varying
strengths, weaknesses, and suitability depending on the nature of the data and the task at
hand. Using an inappropriate algorithm can lead to suboptimal model performance.
Pavithra B, Dept. of CSE(DS),SVIT 15
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
• How to Overcome: Start by understanding the problem you’re trying to solve and the
type of data you have. Supervised learning algorithms like decision trees and support
vector machines are suitable for classification tasks, while unsupervised learning
algorithms like k-means clustering are better for finding hidden patterns in unlabeled
data. Experiment with different algorithms and use techniques like cross-validation to
evaluate their performance. Tools like scikit-learn provide user-friendly interfaces for
implementing and comparing multiple algorithms.
3. Model Interpretability
• Challenge: Machine learning models, especially complex ones like deep neural networks,
are often referred to as “black boxes” because it can be challenging to understand how
they arrive at their predictions. This lack of interpretability can be a significant barrier
when trying to build trust in the model’s decisions, particularly in fields like healthcare
and finance where transparency is critical.
• How to Overcome: To improve model interpretability, consider using simpler models like
decision trees or linear models, which are inherently more interpretable. For more
complex models, techniques such as LIME (Local Interpretable Model-agnostic
Explanations) and SHAP (SHapley Additive exPlanations) can help provide insights into
how the model makes its predictions. Additionally, feature importance scores can help
identify which variables have the most influence on the model’s output.
4. Overfitting and Underfitting
• Challenge: Overfitting occurs when a model learns the noise in the training data rather
than the actual underlying patterns, leading to poor generalization on new data.
Underfitting, on the other hand, happens when the model is too simple to capture the
complexity of the data, resulting in poor performance even on the training data.
• How to Overcome: To combat overfitting, consider techniques such as cross-validation,
regularization (e.g., L1 or L2 regularization), and pruning for decision trees. Additionally,
ensuring that your training data is representative of the real-world data the model will
encounter is crucial. For underfitting, try increasing the model complexity by adding
more features, using more powerful algorithms, or tuning the model’s hyperparameters.
Monitoring learning curves can also help you identify and address both overfitting and
underfitting early in the model development process.
5. Bias/Variance
• Bias refers to the error introduced by approximating a real-world problem, which may
be too complex, by a simpler model. High bias typically leads to underfitting, where the
model performs poorly on both training and test data.
Causes:
• Using a model that is too simple for the data (e.g., linear regression for non-
linear data).
• Not having enough features or input variables.
Pavithra B, Dept. of CSE(DS),SVIT 16
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
• Variance refers to the error caused by the model being too sensitive to fluctuations in the
training data. High variance typically leads to overfitting, where the model performs well
on training data but poorly on unseen test data.
Causes:
• Using a model that is too complex for the data (e.g., a deep neural network on a
small dataset).
• Training on noisy or insufficient data.
• The Bias-Variance Tradeoff
These two are inherently in a tradeoff. Decreasing bias often increases variance, and vice
versa. The goal is to find the right balance between them for optimal model performance.
6. Machine Learning process
Machine learning uses the CRISP-DM(Cross Industry Standard Process-Data Mining)
process model for learning process. The steps involved in the process are described as
below:
• Business Understanding: This step involves understanding the problem that needs to be
solved and defining the objectives of the data mining project. This includes identifying
the business problem, understanding the goals and objectives of the project, and defining
the KPIs that will be used to measure success. This step is important because it helps
ensure that the data mining project is aligned with business goals and objectives.
• Data Understanding: This step involves collecting and exploring the data to gain a better
understanding of its structure, quality, and content. This includes understanding the
sources of the data, identifying any data quality issues, and exploring the data to identify
patterns and relationships. This step is important because it helps ensure that the data is
suitable for analysis.
• Data Preparation: This step involves preparing the data for analysis. This includes
cleaning the data to remove any errors or inconsistencies, transforming the data to make
it suitable for analysis, and integrating the data from different sources to create a single
dataset. This step is important because it ensures that the data is in a format that can be
used for modeling.
• Modeling: This step involves building a predictive model using machine learning
algorithms. This includes selecting an appropriate algorithm, training the model on the
data, and evaluating its performance. This step is important because it is the heart of the
data mining process and involves developing a model that can accurately predict
outcomes on new data.
Pavithra B, Dept. of CSE(DS),SVIT 17
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
• Evaluation: This step involves evaluating the performance of the model. This includes
using statistical measures to assess how well the model is able to predict outcomes on
new data. This step is important because it helps ensure that the model is accurate and
can be used in the real world.
• Deployment: This step involves deploying the model into the production environment.
This includes integrating the model into existing systems and processes to make
predictions in real-time. This step is important because it allows the model to be used in
a practical setting and to generate value for the organization.
A Machine Learning Process
Pavithra B, Dept. of CSE(DS),SVIT 18
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
7. Machine Learning applications
Pavithra B, Dept. of CSE(DS),SVIT 19
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
8. What is data?
• Data is any collection of facts or measurements that can be recorded and analysed.
• It can exist in various formats, such as numbers, text, images, sounds, or videos.
• It's the foundation upon which machine learning models are built and improved.
• Why data is important ?
▪ Data helps in make better decisions.
▪ Data helps in solve problems by finding the reason for underperformance.
▪ Data helps one to evaluate the performance.
▪ Data helps one improve processes.
▪ Data helps one understand consumers and the market.
• Data is available from different sources such as flat files, databases and data warehouses.
• Operational data: data that is produced by your organization's day to day operations.
Things like customer, inventory, and purchase data fall into this category.
• Non-operational data: the data that is used for decision making
• Data by itself is meaningless unless it is labelled and processed to generate certain
information.
• Processed data is called Information that includes Patterns, Associations and relationships
among data
• Elements of Big Data
▪ Small data: Data whose volume is less and can be stored and processed by small-scale
computer is called Small data
▪ Big data: Big data refers to extremely large and complex data sets that traditional data
processing tools cannot handle efficiently.
▪ The data to be said as Big data must satisfy the following characteristics
▪ Big data is often described by the following characteristics, known as the "Five Vs":
• Volume: The sheer amount of data generated and stored. Big data typically
involves terabytes, petabytes, or even zettabytes of data. . For instance: Social
media platforms handle billions of posts, likes, and comments daily. Social media
platforms handle billions of posts, likes, and comments daily.
• Velocity: The speed at which data is generated and processed. Big data often
requires real-time or near-real-time processing2. Examples include stock market
transactions, sensor data from connected cars, or website clicks.
• Variety: The different types of data, including structured, semi-structured, and
unstructured data such as text, images, audio, and video. It also refers to
heterogeneous sources.
• Veracity: The quality and reliability of the data. Ensuring data accuracy and
trustworthiness is crucial for meaningful analysis. Cleaning and validating big
data is a vital process to ensure meaningful results.
• Value: The potential insights and benefits that can be derived from analysing big
data. The ultimate goal is to extract valuable information that can drive decision-
making and innovation
Pavithra B, Dept. of CSE(DS),SVIT 20
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
• Variability: How fast or available data that extent is the structure of your data is
changing? How often does the meaning or shape of your data change? Example: if
you are eating same ice-cream daily and the taste just keep changing.
9. Types of data
• Data comes in many different forms, each defined by its unique characteristics,
sources and formats. Understanding these distinctions can allow for more effective
organization and data analysis, as different types of data support different use cases.
• Furthermore, a single data point or data set can fall under multiple categories. For
example, structured and quantitative, unstructured, qualitative and so on.
• Some of the most common types of data include:
o Quantitative data
o Qualitative data
o Structured data
o Unstructured data
o Semi-structured data
o Metadata
o Big data
• Quantitative data
o Quantitative data consists of values that can be measured numerically.
Examples of quantitative data include discrete data points (such as the
number of products sold) or continuous data points (such as temperature or
revenue figures).
o Quantitative data is often structured, making it easy to analyse using
mathematical tools and algorithms.
o Common use cases of quantitative data include trend forecasting, statistical
analysis, budgeting, pattern identification and performance measurement.
• Qualitative data
o Qualitative data is descriptive and non-numerical, capturing characteristics,
concepts or experiences that numbers cannot measure. Examples include
customer feedback, product reviews and social media comments.
o Qualitative data can be structured (such as coded survey responses) or
unstructured (such as free-text responses or interview transcripts).
o Common use cases for qualitative data include understanding customer
behaviour, market trends and user experiences.
• Structured data
o Structured data is organized in a clear, defined format, often stored in
relational databases or spreadsheets. It can consist of both quantitative (such
as sales figures) and qualitative data (such as categorical labels like “yes or
no”).
o Examples of structured data include customer records and financial reports,
where data fits neatly into rows and columns with predefined fields.
o The highly organized nature of structured data allows for quick querying and
data analysis, making it useful for business intelligence systems and
reporting processes.
Pavithra B, Dept. of CSE(DS),SVIT 21
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Structured data encountered in machine learning are as follow:
▪ Record data refers to information that is organized and stored in a record
format, often within databases or tables. Each record typically represents
a single entity or event and consists of multiple fields (columns), where
each field contains a specific attribute or property of the entity.
▪ A data matrix is a structured representation of data arranged in rows and
columns. It is often used in mathematics, statistics, and machine learning
to organize and analyze information. Rows: Represent observations or
instances. Each row corresponds to a single data record. Columns:
Represent variables, features, or attributes of the data.
▪ Graph data represents information in the form of entities (nodes) and
relationships (edges) between them. This structure is widely used to model
complex systems where connections or relationships are significant, such
as social networks, transportation networks, and knowledge graphs.
▪ Ordered data refers to data that is organized in a specific, meaningful
sequence or order. The arrangement of the data is crucial because it
conveys information about the context or relationships between elements.
• Examples of Ordered Data:
o Time Series Data: Time series data refers to a sequence of data points
collected or recorded over time at consistent intervals. This type of data
is widely used in various fields to observe trends, patterns, and
relationships.
o Sequence Data: Sequence data refers to information that is arranged in
a specific order where the arrangement itself is significant. It's
commonly used in contexts where the relationship between data points
depends on their position in the sequence.
o Spatial Data: Spatial data (also known as geospatial data) refers to
information that describes the physical location and shape of objects in
space, as well as the relationships between them. It's widely used in
fields like geography, urban planning, transportation, and
environmental science.
• Unstructured data
o Unstructured data lacks a strictly defined format. It often comes in complex
forms such as text documents, images and videos. Unstructured data can
include both qualitative information (such as customer comments) and
quantitative elements (such as numerical values embedded in text).
o Examples of unstructured data include emails, social media content and
multimedia files.
o Unstructured data doesn’t easily fit into traditional relational databases, and
organizations often use techniques such as natural language processing
(NLP) and machine learning to streamline analysis of unstructured data.
o Unstructured data often plays a key role in sentiment analysis, complex
pattern recognition and other advanced analytics projects.
• Semi-structured data
o Semi-structured data blends elements of structured and unstructured data. It
doesn't follow a rigid format but can include tags or markers that make it
Pavithra B, Dept. of CSE(DS),SVIT 22
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
easier to organize and analyze. Examples of semi-structured data include
XML files and JSON objects.
o Semi-structured data is widely used in scenarios such as web scraping and
data integration projects because it offers flexibility while retaining some
structure for search and analysis.
• Metadata
o Metadata is data about data. In other words, it is information about the
attributes of a data point or data set, such as file names, authors, creation
dates or data types.
o Metadata enhances data organization, searchability and management. It is
critical to systems such as databases, digital libraries and content
management platforms because it helps users more easily sort and find the
data they need.
• Big data
o Big data refers to massive, complex data sets that traditional systems can't
handle. It includes both structured and unstructured data from sources such
as sensors, social media and transactions.
o Big data analytics helps organizations process and analyze these large data
sets to systematically extract valuable insights. It often requires advanced
tools such as machine learning.
10. Big data analytics and types of analytics
• Big data analytics is the process of examining, analyzing, and interpreting large and
complex datasets (big data) to uncover patterns, correlations, trends, and insights. It
involves advanced techniques and tools to handle the sheer scale, variety, and velocity of
data that traditional data processing systems cannot manage effectively.
• Big data analytics has applications across industries—predicting customer behavior in
retail, improving healthcare outcomes, optimizing supply chains, and even driving urban
planning.
• The different types of Big data analytics are as follows:
1. Descriptive Analytics (What happened?)
Definition:
Descriptive analytics helps organizations understand past trends and events by
summarizing historical data. It provides a clear picture of what has happened using
dashboards, reports, and visualizations.
How It Works:
• Uses data aggregation and data mining techniques.
• Relies on tools like charts, graphs, and dashboards to present findings.
Examples:
• A company analyzing monthly sales data to see which products sold the most.
• A website tracking the number of visitors per day and their geographical locations.
• A hospital reviewing patient admission trends over the last five years.
Tools Used:
Excel, Google Analytics, Tableau, Power BI
Pavithra B, Dept. of CSE(DS),SVIT 23
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
2. Diagnostic Analytics (Why did it happen?)
Definition:
This type of analytics digs deeper into data to find the root cause of past events. It
helps answer "why" something happened by looking for correlations and patterns.
How It Works:
• Uses statistical techniques such as drill-down analysis, data discovery, and
correlations.
• Helps businesses understand the factors affecting their performance.
Examples:
• A company analyzing why customer churn increased last quarter by looking at
customer complaints and satisfaction scores.
• A retail store studying why sales dropped in a particular region by checking
weather patterns, competitor activity, or marketing efforts.
• A hospital investigating why patient readmission rates are rising by analyzing
patient records and treatment histories.
Tools Used:
SQL, Python, R, SAS, Splunk
3. Predictive Analytics (What is likely to happen?)
Definition:
Predictive analytics uses historical data, machine learning, and statistical models to
forecast future trends and outcomes. It helps organizations anticipate events and make
proactive decisions.
How It Works:
• Uses techniques like regression analysis, neural networks, and machine learning
algorithms.
• Requires large datasets and computational power to make accurate predictions.
Examples:
• An e-commerce company predicting which products a customer is likely to buy
next based on browsing history.
• A bank assessing a customer’s likelihood of defaulting on a loan using credit history
and spending patterns.
• A weather forecasting system predicting hurricanes based on atmospheric data.
Tools Used:
Python (Scikit-learn, TensorFlow), IBM Watson, AWS Machine Learning,
RapidMiner
4. Prescriptive Analytics (What should we do about it?)
Definition:
Prescriptive analytics takes predictive insights and provides recommendations on what
actions to take. It is the most advanced form of analytics, using AI and optimization
algorithms to suggest the best course of action.
Pavithra B, Dept. of CSE(DS),SVIT 24
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
How It Works:
• Uses AI, machine learning, and optimization techniques to suggest decisions.
• Often involves simulations, scenario analysis, and decision automation.
Examples:
• Google Maps suggesting the fastest route based on real-time traffic conditions.
• An airline adjusting ticket prices dynamically based on demand and weather
conditions.
• A healthcare system recommending personalized treatment plans for patients based
on genetic and medical data.
Tools Used:
IBM Watson, Google AI, SAS, Oracle AI, Apache Spark
11. Big data analytics framework
• Many frameworks have been proposed to analysis of big data. The four layer Big Data
Analytics is as discussed below:
o Data Collection Layer:
▪ Purpose: This layer collects, stores, and manages raw data from various sources.
Key Components:
▪ Data Sources: Structured (databases), semi-structured (logs, XML), and
unstructured (social media, images, videos).
▪ Data Storage: Traditional databases (SQL), NoSQL databases (MongoDB,
Cassandra), Data Lakes (Hadoop, AWS S3).
▪ Data Processing: Batch processing (Hadoop MapReduce) and real-time
processing (Apache Kafka, Apache Spark).
▪ Example: An e-commerce company collects data from website clicks, customer
transactions, and social media interactions.
o Data Management Layer
▪ Purpose: This layer processes, cleans, and transforms raw data into a usable
format.
Key Components:
▪ ETL (Extract, Transform, Load): Extracts data from different sources, cleans it,
and loads it into a data warehouse (e.g., Apache Nifi, Talend, AWS Glue).
▪ Data Governance: Ensures data quality, security, and compliance (GDPR,
HIPAA).
▪ Data Warehouses: Stores processed data for analysis (Google BigQuery,
Snowflake, Amazon Redshift).
▪ Example: A telecom company cleans call records, removes duplicate data, and
organizes it for billing analytics.
o Data Analysis Layer
▪ Purpose: This layer applies different analytics techniques to generate insights.
Key Components:
Pavithra B, Dept. of CSE(DS),SVIT 25
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
▪ Descriptive Analytics: Dashboards, reports (Tableau, Power BI).
▪ Diagnostic Analytics: Root cause analysis, trend identification (Python, R, SQL).
▪ Predictive Analytics: Machine learning models (TensorFlow, Scikit-learn).
▪ Prescriptive Analytics: AI-driven decision-making (IBM Watson, Google AI).
▪ Example: A bank uses machine learning to predict loan default risks based on
customer data.
o Data Visualization Layer
▪ Purpose: This layer presents insights through reports, dashboards, and business
applications.
Key Components:
▪ Data Visualization: Power BI, Tableau, Looker.
▪ Business Applications: CRM, ERP, AI-powered recommendation systems.
▪ User Interaction: Mobile apps, web dashboards, AI chatbots.
▪ Example: A retailer uses a real-time dashboard to track inventory levels and predict
demand fluctuations.
12. Descriptive statistics
• Descriptive statistics refers to a set of statistical methods used to summarize and present
data in a clear and understandable form. It involves organizing raw data into tables,
charts, or numerical summaries, making it easier to identify patterns, trends, and
anomalies.
• Its primary aim is to define and analyze the fundamental characteristics of a dataset
without making sweeping generalizations or assumptions about the entire data set.
• It helps to organize the data in a more manageable and readable format.
• Dataset and Datatypes:
o The different data types are listed in the following chart
Pavithra B, Dept. of CSE(DS),SVIT 26
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Categorical or Qualitative data
• Categorical data can be put in groups or categories using names or labels. This
grouping is typically generated using a matching procedure based on data
attributes and similarities between these qualities.
• Each piece of a categorical dataset, also known as qualitative data, may be
assigned to only one category based on its qualities, and each category is mutually
exclusive.
• There are two primary categories of categorical data:
• Nominal data:
• This is the data category that names or labels its categories. It has features
resembling a noun and is occasionally referred to as naming data.
• Example: Colors(red,blue,green), gender(M/F), and types of animals.
•
• Ordinary data:
• Elements with rankings, orders, or rating scales are included in this category
of categorical data. Nominal data can be ordered and counted but not
measured.
• Example: Education level (high school, college, graduate), temperature
(high,medium,low)
• Numerical or Quantitative data
• Data expressed in numerical terms rather than in natural language descriptions are
called numerical data. It can only be gathered in numerical form, keeping its name.
This numerical data type also referred to as quantitative data can be used to
measure a person’s height, weight, IQ, etc.
Pavithra B, Dept. of CSE(DS),SVIT 27
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
• Numerical data can be of two types:
• Discrete Data:
• Countable numerical data are discrete data. They are mapped one-to-one to
natural numbers, in other words.
• Example: Age, the number of students in a class, the number of candidates in
an election, etc., are a few examples of discrete data in general.
• Continuous Data:
• This is an uncountable data type for numbers. A series of intervals on a natural
number line is used to depict them.
• Example: Student CGPA, height, and other continuous data types are a few
examples.
• Interval Data:
• In ordinal scales, the interval between adjacent values is not constant. For
example, the difference in finishing time between the 1st place horse and the 2nd
horse need not the same as that between the 2nd and 3rd place horses.
• An interval scale has a constant interval but lacks a true 0 point. As a result, one
can add and subtract values on an interval scale, but one cannot multiply or
divide units.
• Hence, it is similar to ordinal but the differences or intervals between values or
rankings are equally split. Therefore, the difference between a ranking of a 7 and
a 6 is the same as the difference between a ranking of a 9 and a 10 on a 10-point
scale.
• Ratio Data:
• Ratio data is the most complex of the fours scales of measurement, as well as
the most preferred scale of measurement.
• It has all the same properties of interval data but possesses a natural zero,
meaning there is a point where that measurement, whatever it may be, is not
existing.
• A ratio scale has the property of equal intervals but also has a true 0 point. As a
result, one can multiply and divide as well as add and subtract using ratio scales.
• Variables such as height weight and duration are ratio data.
o Third way of categorizing data
Pavithra B, Dept. of CSE(DS),SVIT 28
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o
13. Univariate Data Analysis and Visualization
• Univariate data refers to a type of data in which each observation or data point
corresponds to a single variable.
• It involves the measurement or observation of a single characteristic or attribute for each
individual or item in the dataset.
• Analysing univariate data is the simplest form of analysis in statistics.
• Key points in Univariate analysis:
• No Relationships: Univariate analysis focuses solely on describing and summarizing the
distribution of the single variable. It does not explore relationships between variables or
attempt to identify causes.
• Descriptive Statistics: Descriptive statistics, such as measures of central tendency (mean,
median, mode) and measures of dispersion (range, standard deviation), are commonly
used in the analysis of univariate data.
• Visualization: Histograms, box plots, and other graphical representations are often used
to visually represent the distribution of the single variable
• Types of univariate analyses
The following are the most common types of summary statistics:
o Measures of dispersion: these numbers describe how evenly the values are
distributed in a dataset. The range, standard deviation, interquartile range, and
variance are some examples.
o Range: the difference between the highest and lowest value in a data set.
o Standard deviation: an average measure of the spread.
o Interquartile range: the spread of the middle 50% of the values.
o Measures of central tendency: these numbers describe the location of the center
point of a data set or the middle value of the data set. The mean, median and mode
are the three main measures of central tendency.
• Data Visualization
o Data visualization is the graphical representation of information and data.
o By using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
o It provides an excellent way for employees or business owners to present data to non-
technical audiences without confusion.
o Businesses, researchers, and analysts rely on visualization tools to interpret large
datasets efficiently, detect anomalies, and drive strategic insights.
o By transforming numbers into meaningful visuals, data visualization enhances
comprehension, storytelling, and informed decision-making across industries.
o Why is Data Visualization Important?
o Simplifying Complex Data: Data visualization simplifies large and complex
datasets by converting them into graphs, charts, and interactive visuals for
decision-making more efficient and accessible
o Enhancing Data Interpretation: Visual representation of data enables better
pattern recognition and insight extraction. Businesses can identify
Pavithra B, Dept. of CSE(DS),SVIT 29
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
correlations, track performance, and detect anomalies, leading to more
informed strategic planning.
o Saving Time in Decision-Making: Visualization tools accelerate business
intelligence and research insights by providing instant data overviews.
o Improving Communication: Pie charts, trend graphs, and infographics help
professionals present findings in a way that is easier to understand for
stakeholders. Whether in business reports, investor meetings, or research
presentations, visualizations enhance engagement and message retention.
o Strengthening Big Data Analytics: As organizations handle massive data
volumes, visualization becomes essential for processing, filtering, and
analyzing large-scale datasets. AI-powered real-time analytics dashboards
and predictive modeling tools enable businesses to extract actionable insights
from structured and unstructured data, driving efficiency in industries like
finance, healthcare, and e-commerce.
o Advantages of Data Visualization:
o Faster Data Comprehension – Data visualization enables users to grasp
complex information quickly by presenting it in an intuitive format. Instead of
analyzing spreadsheets or raw data, decision-makers can interpret charts, graphs,
and dashboards efficiently.
o Identification of Correlations and Anomalies – Visualizing data helps detect
patterns, relationships, and outliers that might not be apparent in raw datasets.
This is particularly useful in fraud detection, market analysis, and performance
tracking.
o Enhanced User Engagement – Well-designed visualizations make data more
accessible and engaging. Interactive dashboards and AI-driven visualizations
allow users to explore datasets dynamically, improving data-driven storytelling
and communication.
o Disadvantages of Data Visualization:
o Risk of Misinterpretation Due to Poor Design – Incorrect use of chart types,
misleading scales, or color schemes can distort insights, leading to flawed
conclusions. Overly complex visualizations may also confuse users rather than
clarify data.
o Data Bias and Misleading Representations – Visualization tools can
inadvertently amplify biases if data selection is not handled carefully. Cherry-
picked data or improper aggregation may result in skewed narratives that mislead
decision-makers.
o Performance Challenges with Large Datasets – Handling massive datasets in
real-time dashboards requires significant computational power and efficient
algorithms. Poorly optimized visualizations can slow down performance, making
analysis cumbersome for users.
o Data Visualization and Big Data - As organizations generate massive volumes
of data, visualization plays a crucial role in analyzing, interpreting, and extracting
insights from large-scale datasets. Without visual representation, handling big
Pavithra B, Dept. of CSE(DS),SVIT 30
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
data can be overwhelming, making it difficult to identify trends and correlations.
Advanced visualization techniques, such as real-time dashboards, heatmaps, and
predictive analytics models, help simplify complex data structures, enabling faster
and more informed decision-making.
o Types of Data Visualizations
▪ Bar Charts – Used to compare categorical data, making them ideal for sales
analysis, survey results, and financial reporting.
▪ Pie Chart: A circular chart with triangular segments that shows data as a
percentage of a whole.
▪ Histogram: A type of bar chart that split a continuous measure into different
bins to help analyze the distribution.
▪ Dot Plots: The Wilkinson dot plot represents the distribution of continuous data
in the form of individual dots for each value.
▪ Heatmaps – Represent data density using color gradients, commonly used in
website analytics and geographic data analysis.
o Bar Charts
o Bar charts enable us to compare numerical values like integers and
percentages.
o They use the length of each bar to represent the value of each variable. For
example, bar charts show variations in categories or subcategories scaling
width or height across simple, spaced bars, or rectangles.
o Bar charts can represent quantitative measures vertically, on the y-axis, or
horizontally, on the x-axis. The style depends on the data and on the
questions the visualization addresses.
o The qualitative dimension will go along the opposite axis of the quantitative
measure.
o Bar charts typically have a baseline of zero. If another starting point is used,
the axis should be clearly labeled to avoid misleading the viewer.
o A good bar chart will follow these rules:
o The base starts at zero
o The axes are labeled clearly
o Colors are consistent and defined
o The bar chart does not display too many bars
o When creating a bar chart, do not:
o Make each bar a different width
o Cram too many bars into subcategories
o Leave the axes unlabeled
o Following is the simple example:
Pavithra B, Dept. of CSE(DS),SVIT 31
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Pie Chart
o A pie chart is a circular statistical graphic that visually displays data in a circular
graph. It is divided into slices to illustrate numerical proportion, where the arc
length of each slice (and consequently its central angle and area) is proportional
to the quantity it represents.
o Pie charts are commonly used to represent data using the attributes of circles,
spheres, and angular data to represent real-world information.
o Pie Chart Formula
We know that the total value of the pie is always 100%. It is also known that a
circle subtends an angle of 360°. Hence, the total of all the data is equal to 360°.
Based on these, there are two main formulas used in pie charts:
o To calculate the percentage of the given data, we use the formula:
(Frequency ÷ Total Frequency) × 100
o To convert the data into degrees we use the formula:
(Given Data ÷ Total value of Data) × 360°
o Following is the example of pie chart
o Histogram
o A histogram is the graphical representation of data where data is grouped into
continuous number ranges and each range corresponds to a vertical bar.
o The horizontal axis displays the number range.
o The vertical axis (frequency) represents the amount of data that is present in each
range.
Pavithra B, Dept. of CSE(DS),SVIT 32
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o The number ranges depend upon the data that is being used
o A histogram shows the distribution of a dataset, it is used for displaying the
continuous (or quantitative) form of data frequency distribution.
o On a histogram, data is shown in intervals, but the height in each bar relates to
the frequencies or counts where data points were found in one specific interval.
o It allows us to assess where the values are concentrated, what the extremes
are, and whether there are any gaps or anomalous values
o For example: in a hospital, there are 20 newborn babies whose ages in increasing
order are as follows: 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5. This
information can be shown in a frequency distribution table as follows:
Histogram of the Hospital data
o Dot plots
o A dot plot is a simple form of data visualization that consists of data points plotted as
dots on a graph with an x- and y-axis.
o These types of charts are used to graphically depict certain data trends or groupings. A
dot plot is similar to a histogram in that it displays the number of data points that fall
into each category or value on the axis, thus showing the distribution of a set of data.
o There are two types of dot plots: the Cleveland and Wilkinson dot plots.
o This type of charting method is commonly used by the Federal Open Market Committee
(FOMC).
o Dot plots are generally arranged with one axis showing the range of values or categories
along which the data points are grouped.
o The second axis shows the number of data points in each group. Dots may be vertically
or horizontally stacked to show how many are in each group for easy visual comparison.
Pavithra B, Dept. of CSE(DS),SVIT 33
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o
o Central tendency
▪ A measure of central tendency is a single value that represents the center point of
a dataset. This value can also be referred to as “the central location” of a dataset.
▪ In statistics, there are three common measures of central tendency:
▪ The mean
▪ The median
▪ The mode
▪ Each of these measures finds the central location of a dataset using different
methods. Depending on the type of data you’re analyzing, one of these three
measures may be better to use than the other two.
▪ The main functions of measures of central tendency are as follows:
1) They provide a summary figure with the help of which the central location of the
whole data can be explained. When we compute an average of a certain group we
get an idea about the whole data.
2) Large amount of data can be easily reduced to a single figure. Mean, median and
mode can be computed for a large data and a single figure can be derived.
3) When mean is computed for a certain sample, it will help gauge the population
mean.
4) The results obtained from computing measures of central tendency will help in
making certain decisions. This holds true not only to decisions with regard to
research but could have applications in varied areas like policy making, marketing
and sales and so on.
5) Comparison can be carried out based on single figures computed with the help of
measures of central tendency. For example, with regard to performance of
students in mathematics test, the mean marks obtained by girls and the mean
marks obtained by boys can be compared.
o Mean or Arithmetic Mean
o Mean for sample is denoted by symbol ‘M or x̅ (‘x-bar’)’ and mean for population is
denoted by ‘µ’ (mu).
Pavithra B, Dept. of CSE(DS),SVIT 34
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o It is one of the most commonly used measures of central tendency and is often referred
to as average. It can also be termed as one of the most sensitive measure of central
tendency as all the scores in a data are taken in to consideration when it is computed.
o Further statistical techniques can be computed based on mean, thus, making it even
more useful. Mean is a total of all the scores in data divided by the total number of
scores.
o For example, if there are 100 students in a class and we want to find mean or average
marks obtained by them in a psychology test, we will add all their marks and divide
by 100, (that is the number of students) to obtain mean.
o Properties of Mean
o Advantages of Mean
1) The definition of mean is rigid which is a quality of a good measure of central
tendency.
2) It is not only easy to understand but also easy to calculate.
3) All the scores in the distribution are considered when mean is computed.
4) Further mathematical calculations can be carried out on the basis of mean.
5) Fluctuations in sampling are least likely to affect mean.
o Limitations of Mean
1) Outliers or extreme values can have an impact on mean.
2) When there are open ended classes, such as 10 and above or below 5, mean cannot
be computed. In such cases median and mode can be computed. This is mainly
because in such distributions mid point cannot be determined to carry out
calculations.
Pavithra B, Dept. of CSE(DS),SVIT 35
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
3) If a score in the data is missing or lost or not clear, then mean cannot be computed
unless mean is computed for rest of the data by not considering the lost score and
dropping it all together. 4) It is not possible to determine mean through inspection.
Further, it cannot be determined based on a graph.
5) It is not suitable for data that is skewed or is very asymmetrical as then in such
cases mean will not adequately represent the data.
o Median
o Median is a point in any distribution below and above which lie half of the scores.
Median is also referred to as P50.
o The symbol for median is ‘Md’. As stated by Bordens and Abbott, ‘median is the
middle score in an ordered distribution’.
o If we take the example discussed earlier of the marks obtained by 100 students in a
psychology test, these marks are to be arranged in an order, either ascending or
descending. The middle score in this distribution is then identified as median.
Though this would seem easy for an odd number of scores, in case of even number
of scores a certain procedure is followed that will be discussed when we learn how
to compute median later in this unit.
o Properties of Median
1) Central Tendency When compared to mean, median is less sensitive to extreme
scores or outliers.
2) When a distribution is skewed or is asymmetrical median can be adequately used.
3) When a distribution is open ended, that is, actual score at one end of the distribution
is not known, then median can be computed.
o Advantages
▪ The definition of median is rigid which is a quality of a good measure of central
tendency.
▪ It is easy to understand and calculate.
▪ It is not affected by outliers or extreme scores in data.
▪ Unless the median falls in an open ended class, it can be computed for grouped data
with open ended classes.
▪ In certain cases it is possible to identify median through inspection as well as
graphically.
o Disadvantages
▪ Some statistical procedures using median are quite complex. Computation of
median can be time consuming when large data is involved because the data needs
to be arranged in an order before median is computed.
▪ Median cannot be computed exactly when an ungrouped data is even. In such cases,
median is estimated as mean of the scores in the middle of the distribution.
▪ It is not based on each and every score in the distribution.
▪ It can be affected by sampling fluctuations and thus can be termed as less stable
than mean.
o Mode
o Mode is the value of that observation which has a maximum frequency corresponding to it.
In other, that observation of the data occurs the maximum number of times in a dataset.
Pavithra B, Dept. of CSE(DS),SVIT 36
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Mode of Ungrouped Data
o Mode of Ungrouped Data can be simply calculated by observing the observation with the
highest frequency. Let’s see an example of the calculation of the mode of ungrouped data.
o For example, if you have a set of numbers: 2, 3, 3, 5, 7, 7, 7, 9, the mode would be 7, as it
appears the most times.
o Mode of Grouped Data
o Formula to find the mode of the grouped data is:
o
o where,
o l is the lower class limit of modal class
o h is the class size
o f1 is the frequency of modal class
o f0 is the frequency of class which proceeds the modal class
o f2 is the frequency of class which succeeds the modal class
o Properties of Mode
1) Mode can be used with variables that can be measured on
nominal scale.
2) Mode is easier to compute than mean and media. But it is not
used often because of lack of stability from one sample to
another and also because a single set of data may possibly
have more than one mode. Also, when there is more than one
mode, then the modes cannot be termed to adequately
measure central location.
3) Mode is not affected by outliers or extreme scores.
o Advantages of Mode
1) It is not only easy to comprehend and calculate but it can also be determined by mere
inspection.
2) It can be used with quantitative as well as qualitative data.
3) It is not affected by outliers or extreme scores.
4) Even if a distribution has one or more than one open ended classe(s), mode can easily be
computed.
o Disadvantages of Mode
1) It is sometimes possible that the scores in the data vary from each other and in such cases
the data may have no mode.
2) Mode cannot be rigidly defined.
3) In case of bimodal, trimodal or multimodal distribution, interpretation and comparison
becomes difficult.
4) Mode is not based on the whole distribution.
5) It may not be possible to compute further mathematical procedures based on mode.
6) Sampling fluctuations can have an impact on mode.
o Dispersion
Pavithra B, Dept. of CSE(DS),SVIT 37
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o dispersion (or spread) is a means of describing the extent of distribution of data around a
central value or point. It aids in understanding data distribution.
o Lower dispersion indicates higher precision in the manufacturing process or data
measurements, whereas higher dispersion means lower accuracy.
o Dispersion means the distance of the scattered data from the central value of the data.
o It gives information regarding the volatility or non-volatility nature of the data set. More
distance from the central point represents a more volatile nature and vice versa.
o In finance, dispersion is inversely proportional to securities' efficiency, yield, or
performance.
o Measure of dispersion can be absolute or relative. Absolute measures have the same unit
of measurement as the given dataset, while relative measures are expressed as ratios and
percentages.
o There are two methods to measure the degree of variation present in the data set:
o Absolute Measure
o Relative Measure
o Range
o Range refers to the difference between the largest and the smallest values in a given
data set. The higher the value of the range, the higher the spread in data.
o R=L-D
o where,
o L = Largest value
o S = Smallest value
o Standard Deviation
o Standard deviation
o It is a fundamental concept in statistics that measures the dispersion of data points
which defines the extent to which data points in a dataset deviate from the mean,
providing a clear sense of the variability or spread within the data.
o Mean Deviation is used to tell us about the scatter of the data. The lower degree of
deviation tells us that the observations xi are close to the mean value and the
depression is low. In contrast, the higher degree of deviation tells us that the
observations xi are far from the mean value and the dispersion is high.
o There are two standard deviation formulas that are used to find the Standard
Deviation of any given data set. They are,
o Population Standard Deviation Formula
Pavithra B, Dept. of CSE(DS),SVIT 38
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Standard Deviation Formula Sample
o Formula to compute Standard Deviation
o Interquartile Range
o The Quartiles divide a set of data series into four equal parts. The four parts are namely,
First Quartile (Q1), Second Quartile (Q2), Third Quartile (Q3), and Fourth Quartile
(Q4). We also know Second Quartile (Q2) as the Median of the data series as it also
divides the data into two equal parts.
o First quartile: It divides the data such that one-fourth or the 25% of the values are
below it and the remaining three-fourth or 75% are above it. We also call the first
quartile as a lower quartile. We denote it as Q1.
o Second quartile: It divides the data or observations into two equal parts so that 50%
of the observations are below it and 50% of the observations are above it. We also
know it as Median. We denote it as Q2.
o Third quartile: It divides the series such that three-fourth or 75% of the observation
is below it and the remaining one-fourth or 25% of the observations are above it.
We also call the third quartile as the upper quartile. We denote it as Q3.
o The interquartile range measures the difference between the first quartile (25th
percentile) and third quartile (75th percentile) in a dataset. This represents the spread of
the middle 50% of values.
o we should use the interquartile range when we’re interested in understanding the spread
between the 75th percentile and 25th percentile of a dataset.
o Formula for Interquartile Range
IQR = Q3 – Q1
o Find the inter-quartile range for the following data: 56, 14, 84, 21, 85, 2, 35, 74, 66, 52,
45
o Solution: Arranging the data in ascending order: 2, 14, 21, 35, 45, 52, 56, 66, 74, 84, 85,
o Q1=(N+1)/4th term
=(11+1)/4th term
Pavithra B, Dept. of CSE(DS),SVIT 39
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
= 3rd term
= 21
o And, Q3=3×(N+1)/4th term
=3×(11+1)/4th term
=3×3rd term
= 9th term
= 74
o Interquartile Range = Q3 – Q1= 74 – 21 = 53
o Five Point Summary and Box plots
o Five number summary is a part of descriptive statistics and consists of five values and
all these values will help us to describe the data.
1. The minimum value (the lowest value)
2. 25th Percentile or Q1
3. 50th Percentile or Q2 or Median
4. 75th Percentile or Q3
5. Maximum Value (the highest value)
o Box plots (also called box-and-whisker plots or box-whisker plots) give a good
graphical image of the concentration of the data. They also show how far the extreme
values are from most of the data.
o A box plot is constructed from five values: the minimum value, the first quartile, the
median, the third quartile, and the maximum value. We use these values to compare how
close other data values are to them.
o To construct a box plot, use a horizontal or vertical number line and a rectangular box.
o The smallest and largest data values label the endpoints of the axis.
o The first quartile marks one end of the box and the third quartile marks the other end of
the box.
o Approximately the middle 50 percent of the data fall inside the box.
o The "whiskers" extend from the ends of the box to the smallest and largest data values.
o The median or second quartile can be between the first and third quartiles, or it can be
one, or the other, or both. The box plot gives a good, quick picture of the data.
o Plot the Box-and-whisker plot for the following data
Pavithra B, Dept. of CSE(DS),SVIT 40
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Shape
o The shape of distribution provides helpful insights about the distribution. This includes
the distribution’s peaks, symmetry, uniformity, as well as its tendency to lean towards
the left or right corner.
o The shape of the distribution is a helpful feature that easily reflects the frequency of
values within given intervals. When given a distribution and its shape, here are other
helpful details we can learn about a data set from the shape of its distribution:
o Represents how spread out the data is across the range
o Helps identify which range the mean of the data set lies
o Highlights the range of a given data set
o Skewness
oSkewness is a measure that tells us how much a dataset deviates from a normal
distribution, which is a perfectly symmetrical bell-shaped curve. In simpler terms, it
shows whether the data points tend to cluster more on one side.
o Types of Skewness
▪ Positive skewness/Right Skewed
▪ Negative skewness/Left Skewed
o Positive skewness/Right Skewed
o Positive skewness indicates that if the distribution’s tail is longer on the right side, we
say the data is positively skewed. This means there are a few unusually high values.
o This type of distribution is called right-skewed. When you measure this skewness, the
number you get is bigger than zero. Imagine looking at a graph of this data: the average
(mean) value is usually the highest, followed by the middle value (median), and then
the most common value (mode).
Pavithra B, Dept. of CSE(DS),SVIT 41
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o While in negative skewness, if the tail is longer on the left side, the data is negatively
skewed. This indicates a few unusually low values.
o Negative skewness/Left Skewed
o A negatively skewed distribution is one where the long tail extends to the left, known
as left-skewed. For such distributions, the skewness value is less than zero.
o The left tail of the distribution is longer or fatter than the right.
o The mean is less than the median, and the mode is greater than both mean and median.
o Higher values are clustered in the “hill” of the distribution, while extreme values are in
the long left tail.
o It is also known as left-skewed distribution.
o
o Formula to compute Skewness
o Another formula highly influenced by the works of Karl Pearson is the moment-
based formula to approximate skewness. It is more reliable and given as follows:
o Kurtosis
o Kurtosis focuses more on the height. It tells us how peaked or flat our normal
(or normal-like) distribution is
o High kurtosis indicates:
Pavithra B, Dept. of CSE(DS),SVIT 42
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
o Sharp peakedness in the distribution’s center.
o More values concentrated around the mean than normal distribution.
o Heavier tails because of a higher concentration of extreme values or
outliers in tails.
o Greater likelihood of extreme events.
o On the other hand, low kurtosis indicates:
o Flat peak.
o Fewer values concentrated around the mean but still more than normal
distribution.
o Lighter tails.
o Lower likelihood of extreme events.
o Types of Kurtosis are as follows:
o Leptokurtic: Leptokurtic is a curve having a high peak than the normal
distribution. In this curve, there is too much concentration of items near
the central value.
o Mesokurtic: Mesokurtic is a curve having a normal peak than the normal
curve. In this curve, there is equal distribution of items around the central
value.
o Platykurtic: Platykurtic is a curve having a low peak than the normal
curve is called platykurtic. In this curve, there is less concentration of
items around the central value.
o
o Formula to calculate Kurtosis is
o Mean Absolute Deviation(MAD)
o Mean Absolute Deviation is one of the metrics of statistics that helps us find out
the average spread of the data i.e., Mean Absolute Deviation shows the average
distance of the observation of the dataset from the mean of the dataset. It is helpful
Pavithra B, Dept. of CSE(DS),SVIT 43
Sri Sai Vidya Vikas Shikshana Samithi ®
SAI VIDYA INSTITUTE OF TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to VTU, Recognized by Govt. of Karnataka)
Accredited by NBA,New Delhi(CSE,ISE,ECE), NAAC-’A’ Grade
DEPARTMENT OF CSE(DATA SCIENCE)
RAJANUKUNTE, BANGALORE 560 064, KARNATAKA
Phone: 080-28468191/96/97/98 * E-mail:
[email protected] * URL www.saividya.ac.in
in the analysis of data and understanding of the data with a much better
understanding.
o Mean Absolute Deviation is one of the measures of the spread which include other
measures i.e., range, quartiles, interquartile range, standard deviation, and variance.
In this article, we will learn about the measure of spread which is Mean Absolute
Deviation, and other than this we will also learn about the formula to find it.
o Formula to compute MAD is
where,
xi represents the each observation of the dataset,
μ is the mean of the data set, and
n is the number of observations in the data set.
Pavithra B, Dept. of CSE(DS),SVIT 44