Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views249 pages

Dev Notes-1

Uploaded by

csedept.srpce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views249 pages

Dev Notes-1

Uploaded by

csedept.srpce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 249

(Accredited by NBA – AICTE, New Delhi, NAAC with ‘A’ Grade, affiliated to Anna

University
Chennai & Accredited by TATA Consultancy Services)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3301- DATA EXPLORATION AND VISUALIZATION


[ REGULATION – 2021]

STUDY MATERIAL

NAME OF THE STUDENT:…………………………………………………………………….

REGISTER NUMBER:…………………………………………………………………………....

YEAR / SEM:………………………………………………………………………………………...

ACADEMIC YEAR:………………………………………………………………………………

PREPARED BY

Mrs. N. JANCIRANI, AP/AI&DS

Mrs. S. THILAGAVATHI,AP/AI&DS
AD3301 DEV UNIT 1

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / III SEM
AD3301 DATA EXPLORATION AND VISUALIZATION
SYLLABUS
UNIT I
EXPLORATORY DATA ANALYSIS

SYLLABUS: EDA fundamentals – Understanding data science – Significance of


EDA – Making sense of data – Comparing EDA with classical and Bayesian
analysis – Software tools for EDA - Visual Aids for EDA- Data transformation
techniques-merging database, reshaping and pivoting, Transformation
techniques - Grouping Datasets - data aggregation – Pivot tables and cross-
tabulations.

PART A

1. Define Exploratory Data Analysis (EDA). Or What does EDA mean in data?
(NOV/DEC 2023).
 EDA is a process of examining the available dataset to discover patterns,
spot anomalies, test hypotheses, and check assumptions using statistical
measures and other data visualization methods.

2. List the EDA fundamental Steps.(Nov/Dec 2024)


 Exploratory Data Analysis (EDA) fundamentals
 Understanding data science
 The significance of EDA
 Making sense of data
 Comparing EDA with classical and Bayesian analysis
 Software tools available for EDA

3. Define Data Science.


Data science
 Data science is the study of data to extract meaningful insights for business.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 1


AD3301 DEV UNIT 1

 It is a multidisciplinary approach that combines principles and practices


from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.
 A data scientist is a professional who creates programming code and
combines it with statistical knowledge to create insights from data.

4. List the phases or steps in Data Science Lifecycle.


1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. EDA
6. Modeling and algorithm
7. Data Product
8. Communication

5. List the steps in the significance of EDA (Nov/Dec 2024)


a. Problem definition
b. Data preparation
c. Data analysis
d. Development and representation of the results

6. Define dataset and the types of data.


 A dataset contains many observations about a particular object.
 Each observation can have a specific value for each of these variables.
 Types of dataset
o Numerical data
o Categorical data.
7. Compare EDA with classical and Bayesian analysis. (NOV/DEC 2024)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 2


AD3301 DEV UNIT 1

8. List some Software tools available for EDA (NOV /DEC 2024)
 Python: This is an open source programming language widely used in data
analysis, data mining, and data science
 R programming language: R is an open source programming language that
is widely utilized in statistical computation and graphical data analysis.
 Weka: This is an open source data mining package that involves several EDA
tools and algorithms.
 KNIME: This is an open source tool for data analysis and is based on Eclipse.
9. List the Python packages for EDA.
Python tools and packages:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 3


AD3301 DEV UNIT 1

10. List some Visual Aids for EDA. (NOV /DEC 2024)
 Line chart
 Bar chart
 Scatter plot
 Area plot and stacked plot
 Pie chart
 Table chart
 Polar chart
 Histogram
 Lollipop chart
 Choosing the best chart
 Other libraries to explore
11. Define Scatter plot
 Scatter plots are also called scatter graphs, scatter charts,
scattergrams, and scatter diagrams.
 Scatter plots can be constructed in the following two situations:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 4


AD3301 DEV UNIT 1

 When one continuous variable is dependent on another variable


 When both continuous variables are independent
 Hence are sometimes referred to as correlation plots.

12. Define Histogram


 Histogram plots are used to depict the distribution of any continuous
variable.
13. List the different types of charts based on the purposes.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 5


AD3301 DEV UNIT 1

14. Define data transformation and list data transformation techniques.


Data Transformation
 Data transformation is a set of techniques used to convert data from one
format or structure to another format or structure.
 The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data.
 Data transformation techniques are merging database, reshaping and
pivoting.
15. Define stacking and unstacking.
 Stacking: Stack rotates from any particular column in the data to the
rows.
 Unstacking: Unstack rotates from the rows into the column.
16. List some examples of data transformation techniques.
o Data deduplication
o Key restructuring
o Data cleansing
o Data validation
o Format revisioning
o Data derivation.
o Data aggregation.
o Data integration.
o Data filtering

17. What is incomplete data or missing data? How to handle missing data?
 Handling missing data
 Whenever there are missing values, a NaN value is used, which indicates
that there is no value specified for that particular index.
 Reason for Incomplete data:
o Data is retrieved from an external source.
o Joining two different datasets and some values are not matched.
o Missing values due to data collection errors.
o When the shape of data changes, there are new additional rows or
columns that are not determined.
o Reindexing of data can result in incomplete data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 6


AD3301 DEV UNIT 1

18. List the Characteristics of missing values


 Characteristics of missing values in the preceding dataframe:
 An entire row can contain NaN values.
 An entire column can contain NaN values.
 Some values in both a row and a column can be NaN.
19. How to drop or remove the missing values?
 dropna() method is used to remove the missing values from rows.
 The dropna() method just returns a copy of the dataframe by dropping the
rows with NaN.
dfx.store4.dropna()
Dropping by columns
dfx.dropna(how='all', axis=1)

20. What is the purpose of thresh in drop function?


 thresh, to specify a minimum number of NaNs that must exist before the
column should be dropped
dfx.dropna(thresh=5, axis=1)

21. How to fill the missing values? List the types of filling.
 Filling missing values
 Use the fillna() method to replace NaN values with any particular
values.

Code:
filledDf = dfx.fillna(0)
filledDf
Types of filling:
Forwardfill
 ffill() – replace the null values with the value from the previous row
or previous column based on axis parameter.
Backwardfill
 bfill() – backward fill the missing values in the data set.
dfx.store4.fillna(method='bfill')
22. Define Discretization and Binning

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 7


AD3301 DEV UNIT 1

 It is used to convert continuous data into discrete or interval forms. Each


interval is referred to as a bin.

23. Define Outlier.


 Outliers are data points that diverge from other observations for several
reasons.
 The common task is to detect and filter these outliers.
 The main reason for this detection and filtering of outliers is that the
presence of such outliers can cause serious issues in statistical analysis.

24. List the benefits of data transformation (NOV/DEC 2024)


 Data transformation promotes interoperability between several applications.
 Comprehensibility for both humans and computers is improved.
 Data transformation ensures a higher degree of data quality .
 Data transformation ensures higher performance and scalability for modern
analytical databases and data frames.
25. What are the Challenges of Data Transformation
 It requires a qualified team of experts and state-of-the-art infrastructure.
The cost of attaining is high.
 Data transformation requires data cleaning before data transformation and
data migration. It is time-consuming.
 The activities of data transformations involve batch processing. It can be
very slow.

26. Define Pivot tables and cross-tabulations.

Pivot tables
 The pandas.pivot_table() function creates a spreadsheet-style pivot table
as a dataframe.
Cross-tabulations
 Used to compute a simple cross-tabulation of two (or more) factors.

27. State the purpose of data aggregation. (NOV/DEC 2023)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 8


AD3301 DEV UNIT 1

 Aggregation is the process of implementing any mathematical operation


on a dataset or a subset of it.
 Aggregation is one of the many techniques in pandas that's
used to manipulate the data in the data frame for data analysis.
 The Dataframe.aggregate() function is used to apply aggregation across
one or more columns. Some of the most frequently used aggregations are
as follows:
 sum: Returns the sum of the values for the requested axis
 min: Returns the minimum of the values for the requested axis
 max: Returns the maximum of the values for the requested
axis
28. State the purpose of pivot tables in data aggregation. NOV/DEC 2024

Pivot tables are used in data aggregation to summarize, analyze, and


organize large datasets, making it easier to extract valuable insights. They allow
users to:

 Summarize Data
 Group Data
 Filter and Slice Data
 Dynamic Analysis
 Visualize Trends

29. What role do visual aids play in EDA? (NOV/DEC 2024)

 In Exploratory Data Analysis (EDA), visual aids play a crucial role in


helping analysts quickly understand, interpret, and communicate the
underlying patterns, trends, and relationships within the data.

 They are one of the primary tools used in EDA for summarizing the data
and guiding further analysis

30. Mention the process of merging database in data analysis.


(NOV/DEC/2024)

Merging Database

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 9


AD3301 DEV UNIT 1

 In the preceding dataset, the first column contains information about


student identifiers and the second column contains their respective
scores in any subject.
 The structure of the dataframes is the same in both cases.
 Two dataframes from each subject:
o Two for the Software Engineering course
o Two for the Introduction to Machine Learning course

31. Define grouping datasets in data analysis.(Nov/Dec 2024)

 During the data analysis phase, categorizing a dataset into multiple


categories or groups is often essential.

 The pandas groupby function is one of the most efficient and time-saving
features for doing this.

 Groupby provides functionalities that allow us to split-apply-combine


throughout the dataframe.

32. What is meant by cross-tabulations in data analysis? (NOV/DEC 2024)

 Cross-tabulations
 Used to compute a simple cross-tabulation of two (or more) factors.
Syntax:
pandas.crosstab(index, columns, values=None, rownames=None,
colnames=None, aggfunc=None, margins=False, margins_name=’All’,
dropna=True, normalize=False)

33. Mention the key responsibilities of a data analyst.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 10


AD3301 DEV UNIT 1

The key responsibilities of a Data Analyst typically include:


 Data Collection and Acquisition

 Data Cleaning and Preparation

 Data Analysis and Interpretation

 Data Visualization

 Reporting and Communication

 Database Management

 Collaboration

 Problem-Solving

 Monitoring KPIs

 Ensuring Data Quality and Integrity

34. Name some of the best tools used for data analysis and data
visualization.

Data Analysis Tools:


1. Microsoft Excel
2. SQL
3. Python
4. R
5. Jupyter Notebook

35. What is the role of histograms in understanding data distributions?


(Apr/May 2025)

1. Visualizing Data Distribution

2. Identifying the Shape of Data

3. Summarizing Large Datasets

4. Supporting Statistical Analysis

36. How can EDA assist in feature selection for machine learning models?
(Apr/May 2025)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 11


AD3301 DEV UNIT 1

1. Detecting Missing Values

2. Identifying Low-Variance Features

3. Understanding Data Distributions

4. Finding Outliers

5. Detecting Multi collinearity

6. Analyzing Feature vs Target Relationship

7. Encoding and Transformation Insights

8. Feature Engineering Opportunities

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 12


AD3301 DEV UNIT 1

PART B

1. Explain in detail about Exploratory Data Analysis (EDA) fundamentals.


 Exploratory Data Analysis (EDA)
o EDA is a process of examining the available dataset to discover patterns,
spot anomalies, test hypotheses, and check assumptions using
statistical measures and other data visualization methods.

 Exploratory Data Analysis (EDA) fundamentals


 Understanding data science
 The significance of EDA
 Making sense of data
 Comparing EDA with classical and Bayesian analysis
 Software tools available for EDA

 Understanding data science


Data science
 Data science is the study of data to extract meaningful insights for
business.
 It is a multidisciplinary approach that combines principles and practices
from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.
 This analysis helps data scientists to ask and answer questions like what
happened, why it happened, what will happen, and what can be done
with the results.
 A data scientist is a professional who creates programming code and
combines it with statistical knowledge to create insights from data.

Phases of Data Analysis / Data Science Lifecycle/


Steps in Data Science Process
1. Data requirements:
 There can be various sources of data for an organization.
 Data can be pre-existing, newly acquired, or a data repository
downloadable from the internet.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 13


AD3301 DEV UNIT 1

 Data scientists can extract data from internal or external databases,


company CRM software, web server logs, and social media or
purchase it from trusted third-party sources.
2. Data collection:
 Data collected from several sources must be stored in the correct
format and transferred to the right information technology personnel
within a company.
3. Data processing:
 Preprocessing involves the process of pre-curating the dataset before
actual analysis. Common tasks involve correctly exporting the
dataset, placing them under the right tables, structuring them, and
exporting them in the correct format.
4. Data cleaning:
 Preprocessed data must be correctly transformed for an
incompleteness check, duplicates check, error check, and missing
value check.
 The data cleaning stage, involves responsibilities such as matching
the correct record, finding inaccuracies in the dataset, understanding
the overall data quality, removing duplicate items, and filling in the
missing values.
5. EDA:
 Key components of exploratory data analysis include summarizing
data, statistical analysis, and visualization of data.
 EDA is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions
using statistical measures and other data visualization methods.
6. Modeling and algorithm:
 Generalized models or mathematical formulas can represent or
exhibit relationships among different variables, such as correlation or
causation.
 These models or equations involve one or more variables that depend
on other variables to cause an event.
 A model always describes the relationship between independent and
dependent variables.
 Inferential statistics deals with quantifying relationships between
particular variables.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 14


AD3301 DEV UNIT 1

 The Judd model for describing the relationship between data, model,
and error still holds true: Data = Model + Error.
7. Data Product:
 Any computer software that uses data as inputs, produces outputs,
and provides feedback based on the output to control the environment
is referred to as a data product.
 A data product is generally based on a model developed during data
analysis, for example, a recommendation model that inputs user
purchase history and recommends a related item that the user is
highly likely to buy.
8. Communication:
 Data visualization deals with information relay techniques such as
tables, charts, summary diagrams, and bar charts to show the
analyzed result.

 The significance of EDA - Steps in EDA


Problem definition:
 The main tasks involved in problem definition are
o defining the main objective of the analysis,
o defining the main deliverables,
o outlining the main roles and responsibilities,
o obtaining the current status of the data,
o defining the timetable,
o performing cost/benefit analysis.
 Based on such a problem definition, an execution plan can be created.
Data preparation:
 Define the sources of data,
 Define data schemas and tables,
 understand the main characteristics of the data,
 clean the dataset,
 delete non-relevant datasets,
 transform the data
 divide the data into required chunks for analysis.
Data analysis:
 This is one of the most crucial steps that deals with descriptive
statistics and analysis of the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 15


AD3301 DEV UNIT 1

 The main tasks involve


o summarizing the data,
o finding the hidden correlation and relationships among the data,
o developing predictive models,
o evaluating the models,
o calculating the accuracies.
Development and representation of the results:
 This step involves presenting the dataset to the target audience in the
form of graphs, summary tables, maps, and diagrams.
 Most of the graphical analysis techniques include scattering plots,
character plots, histograms, box plots, residual plots, mean plots, and
others.

 Making sense of data


 A dataset contains many observations about a particular object.
 Each observation can have a specific value for each of these variables.
 For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = [email protected]
Weight = 10
Gender = Female
 Types of dataset
 Numerical data
 Categorical data.
Numerical data
 The measurable data is often referred to as quantitative data in
statistics.
 Example: person's age, height, weight, blood pressure, heart rate,
temperature,
 The numerical dataset can be either discrete or continuous types.
 Discrete data
 This is data that is countable and its values can be listed out.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 16


AD3301 DEV UNIT 1

 A variable that represents a discrete dataset is referred to as a discrete


variable. The discrete variable takes a fixed number of distinct values.
 For example, the Country variable can have values such as Nepal,
India, Norway, and Japan.
 Continuous data
 A variable that can have an infinite number of numerical values
within a specific range is classified as continuous data.
 A variable describing continuous data is a continuous variable.
 For example, Temperature of a city

Categorical data
 This type of data represents the characteristics of an object; for example,
gender, marital status.
 This data is often referred to as qualitative datasets in statistics.
Different types of categorical variables:
 A binary categorical variable can take exactly two values and is also
referred to as a dichotomous variable.
 Polytomous variables are categorical variables that can take more
than two possible values.
 Nominal
 A qualitative nominal variable is a qualitative variable where no
ordering is possible or implied in the levels.
 Frequency is the rate at which a label occurs over a period of
time within the dataset.
 Proportion can be calculated by dividing the frequency by the
total number of events.
 Compute the percentage of each proportion.
 Visualize the nominal dataset, using either a pie chart or a bar
chart.
 Ordinal
o A qualitative ordinal variable is a qualitative variable with an order
implied in the levels.
o For instance, if the severity of road accidents has been measured on
a scale such as light, moderate and fatal accidents, this variable is a
qualitative ordinal variable because there is a clear order in the levels.
 Interval

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 17


AD3301 DEV UNIT 1

 In interval scales, both the order and exact differences between the
values are significant.
 Interval scales are widely used in statistics, for example, in the
measure of central tendencies—mean, median, mode, and standard
deviations.
 Ratio
 Ratio scales contain order, exact values, and absolute zero, which
makes it possible to be used in descriptive and inferential statistics.

 Comparing EDA with classical and Bayesian analysis


 Classical data analysis:
For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by
analysis and result communication.
 Exploratory data analysis approach:
For the EDA approach, it follows the same approach as classical data
analysis except the model imposition and the data analysis steps are
swapped. The main focus is on the data, its structure, outliers, models, and
visualizations.
 Bayesian data analysis approach:
The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps .
 The following figure 1.1 shows three different approaches for data analysis
illustrating the difference in their execution steps:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 18


AD3301 DEV UNIT 1

Figure 1.1 - Three different approaches for data analysis

 Software tools available for EDA (Provide an explanation of the


various EDA tools that are used for data analysis. (NOV/DEC 2023, NOV
/DEC 2024).
There are several software tools that are available to facilitate EDA.
 Python: This is an open source programming language widely used in data
analysis, data mining, and data science

Python tools and packages:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 19


AD3301 DEV UNIT 1

 R programming language: R is an open source programming language that


is widely utilized in statistical computation and graphical data analysis.
 Weka: This is an open source data mining package that involves several EDA
tools and algorithms.
 KNIME: This is an open source tool for data analysis and is based on Eclipse.

2. Explain in detail about Visual Aids for EDA.


 Line chart
 Bar chart
 Scatter plot
 Area plot and stacked plot
 Pie chart
 Table chart
 Polar chart
 Histogram
 Lollipop chart
 Choosing the best chart
 Other libraries to explore

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 20


AD3301 DEV UNIT 1

 Line chart
 A line chart is used to illustrate the relationship between two or more
continuous variables.
Process of creating the line chart:
1. Load and prepare the dataset.
2. Import the matplotlib library.
import matplotlib.pyplot as plt
3. Plot the graph:
plt.plot(df)
4. Display it on the screen:
plt.show()

Code:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (14, 10)
plt.plot(df)

Output: Refer Figure 1.2

Figure 1.2 – Example Line Chart

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 21


AD3301 DEV UNIT 1

 Bar charts
 Bars can be drawn horizontally or vertically to represent categorical
Variables.
 Bar charts are frequently used to distinguish objects between distinct
collections in order to track variations over time.

Process of creating the bar chart:


1. Import the required libraries:
import numpy as np
import calendar
import matplotlib.pyplot as plt
2. Set up the data.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200))
for x in range(1,13)]
3. Specify the layout of the figure and allocate space:
figure, axis = plt.subplots()
4. In the x axis, display the names of the months:
plt.xticks(months, calendar.month_name[1:13], rotation=20)
5. Plot the graph:
plot = axis.bar(months, sold_quantity)
6. This step is optional in displaying the data value on the head of the
bar.
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 *
height, '%d' % int(height), ha='center', va = 'bottom')
7. Display the graph on the screen:
plt.show()

Output: Refer Figure 1.3

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 22


AD3301 DEV UNIT 1

Figure 1.3– Example Bar Chart (Vertical)

Important observations from the preceding visualizations:


 months and sold_quantity are Python lists representing the amount of
Zoloft sold every month.
 The plt.xticks() function, allows to change the x axis tickers from 1 to 12,
whereas calender.months[1:13] changes this numerical format into
corresponding months from the calendar Python library.
 ax.text() within the for loop annotates each bar with its corresponding
values.
 Plot the values by getting the x and y coordinates and then adding
bar_width/2 to the x coordinates with a height of 1.002, which is the y
coordinate.
 Then, using the va and ha arguments, align the text centrally over the bar,
the bars can be either horizontal or vertical.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 23


AD3301 DEV UNIT 1

Horizontal format – Code


months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1,
13)]
figure, axis = plt.subplots()
plt.yticks(months, calendar.month_name[1:13], rotation=20)
plot = axis.barh(months, sold_quantity)
for rectangle in plot:
width = rectangle.get_width()
axis.text(width + 2.5, rectangle.get_y() + 0.38, '%d' % int(width),
ha='center', va = 'bottom')
plt.show()

Output: Refer Figure 1.4

Figure 1.4– Example Bar Chart (Horizontal)

 Scatter plot
 Scatter plots are also called scatter graphs, scatter charts, scattergrams,
and scatter diagrams.
 They use a Cartesian coordinates system to display values of typically two
variables for a set of data.
 Scatter plots can be constructed in the following two situations:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 24


AD3301 DEV UNIT 1

 When one continuous variable is dependent on another variable


 When both continuous variables are independent
 Hence are sometimes referred to as correlation plots.
Process of creating the Scatterplot:
Import the required libraries and then plot the actual graph.
Next, display the x-label and the y-label.
Code using seaborn to load the dataset:
1. Import seaborn and set some default parameters of matplotlib:
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 150
2. Use style from seaborn.
sns.set()
3. Load the Iris dataset:
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor":
1,"virginica": 2})
4. Create a regular scatter plot:
plt.scatter(x=df["sepal_length"], y=df["sepal_width"], c
=df.species)
5. Create the labels for the axes:
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
6. Display the plot on the screen:
plt.show()
Output: Refer Figure 1.5

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 25


AD3301 DEV UNIT 1

Figure 1.5– Example Scatterplot

 Bubble chart
 A bubble plot is a manifestation of the scatter plot where each data point
on the graph is shown as a bubble.
 Each bubble can be illustrated with a different color, size, and appearance.
Steps
# Load the Iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
# Create bubble plot
plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species, alpha=0.3
)
# Create labels for axises
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()
Output – Refer Figure 1.6

Figure 1.6 – Example Bubble Chart

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 26


AD3301 DEV UNIT 1

 Area plot and stacked plot


 The stacked plot is used to visualize the cumulative effect of multiple
variables being plotted on the y axis.
Steps: Define the dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000,
8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]
# Utilities Bills for a year
utilitiesBills = [4218, 4218, 4218, 4218,
4218, 4218, 4219, 2218,
3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892,
334, 222, 300, 800,
900, 582, 596, 222]
# Car mortgage cost for one year
carMortgage = [700, 701, 702, 703,
704, 705, 706, 707,
708, 709, 710, 711]
Import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
months= [x for x in range(1,13)]

# Create placeholders for plot and add required color


plt.plot([],[], color='sandybrown', label='houseLoanMortgage')
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortgage')

# Add stacks to the plot


plt.stackplot(months, houseLoanMortgage, utilitiesBills,
transportation, carMortgage, colors=['sandybrown', 'tan',
'bisque', 'darkcyan'])

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 27


AD3301 DEV UNIT 1

plt.legend()

# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')

# Display on the screen


plt.show()
Output – Refer Figure 1.7

Figure 1.7 – Area Plot Example

 Pie chart
The purpose of the pie chart is to communicate proportions.
Code 1:
import matplotlib.pyplot as plt
plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False,
startangle=90, autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
Output – Refer Figure 1.8

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 28


AD3301 DEV UNIT 1

Figure 1.8 – Pie Chart Example

Code 2
pokemon.plot.pie(y="amount", figsize=(20, 10))

Output – Refer Figure 1.9

Figure 1.9– Pie Chart Example

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 29


AD3301 DEV UNIT 1

 Table chart
 A table chart combines a bar chart and a table.
Example
Consider standard LED bulbs that come in different wattages. The standard
Philips LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5 Watts,
13.5 Watts, and 15 Watts.
Let's assume there are two categorical variables, the year and the wattage,
and a numeric variable, which is the number of units sold in a particular
year.
Let's declare variables to hold the years and the available wattage data.
# Years under consideration
years = ["2010", "2011", "2012", "2013", "2014"]
# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]
# Define the range and scale for the y axis
values = np.arange(0, 600, 100)

colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))


index = np.arange(len(columns)) + 0.3
bar_width = 0.7
y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()
cell_text = []
n_rows = len(unitsSold)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 30


AD3301 DEV UNIT 1

Add the table to the bottom of the chart:


the_table = plt.table(cellText=cell_text, rowLabels=years,
rowColours=colors, colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])
plt.title('Number of LED Bulb Sold/Year')
plt.show()

Output: Refer Figure 1.10

Figure 1.10 – Table Chart Example

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 31


AD3301 DEV UNIT 1

 Polar chart
A polar chart is a diagram that is plotted on a polar axis, its coordinates
are angle and radius, it is also referred to as a spider web plot.
Create the dataset:
1. Let's assume five courses in academic year:
subjects = ["C programming", "Numerical methods", "Operating
system", "DBMS", "Computer Networks"]
2. And planned to obtain the following grades in each subject:
plannedGrade = [90, 95, 92, 68, 68, 90]
3. However, these are the grades got:
actualGrade = [75, 89, 89, 80, 80, 75]
Steps
1. Import the required libraries:
import numpy as np
import matplotlib.pyplot as plt
2. Prepare the dataset and set up theta:
theta = np.linspace(0, 2 * np.pi, len(plannedGrade))
3. Initialize the plot with the figure size and polar projection:
plt.figure(figsize = (10,6))
plt.subplot(polar=True)
4. Get the grid lines to align with each of the subject names:
(lines,labels) = plt.thetagrids(range(0,360,
int(360/len(subjects))), (subjects))
5. Use the plt.plot method to plot the graph and fill the area under it:
plt.plot(theta, plannedGrade)
plt.fill(theta, plannedGrade, 'b', alpha=0.2)
6. Now, plot the actual grades obtained:
plt.plot(theta, actualGrade)
7. Add a legend and a nice comprehensible title to the plot:
plt.legend(labels=('Planned Grades','Actual Grades'),loc=1)
plt.title("Plan vs Actual grades by Subject")
8. Finally, show the plot on the screen:
plt.show()

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 32


AD3301 DEV UNIT 1

Output: Refer Figure 1.11

Figure 1.11 – Polar Chart Example

 Histogram
 Histogram plots are used to depict the distribution of any continuous
variable.
 Import the required libraries and create the dataset:
import numpy as np
import matplotlib.pyplot as plt
# Create data set
yearsOfExperience = np.array([10, 16, 14, 5, 10, 11, 16, 14, 3, 14,
13, 19, 2, 5, 7, 3, 20,
11, 11, 14, 2, 20, 15, 11, 1, 15, 15, 15, 2, 9, 18, 1, 17, 18,
13, 9, 20, 13, 17, 13, 15, 17, 10, 2, 11, 8, 5, 19, 2, 4, 9,
17, 16, 13, 18, 5, 7, 18, 15, 20, 2, 7, 0, 4, 14, 1, 14, 18,
8, 11, 12, 2, 9, 7, 11, 2, 6, 15, 2, 14, 13, 4, 6, 15, 3,
6, 10, 2, 11, 0, 18, 0, 13, 16, 18, 5, 14, 7, 14, 18])
yearsOfExperience

To plot the histogram chart, execute the following steps:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 33


AD3301 DEV UNIT 1

1. Plot the distribution of group experience:


nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins)
2. Add labels to the axes and a title:
plt.xlabel("Years of experience with Python Programming")
plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the
vocational training session")
3. Draw a green vertical line in the graph at the average experience:
plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')
4. Display the plot:
plt.show()

Output: Refer Figure 1.12

Figure 1.12 – Histogram Example

 Lollipop chart
 A lollipop chart can be used to display ranking in the data.
Steps
1. Load the dataset:
#Read the dataset
carDF =
pd.read_csv('https://raw.githubusercontent.com/PacktPublis
hing/hand
s-on-exploratory-data-analysis-withpython/

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 34


AD3301 DEV UNIT 1

master/Chapter%202/cardata.csv')

2. Group the dataset by manufacturer.

3. Sort the values by cty and reset the index

3. Write the actual mean values in the plot, and display the plot:
# Write the values in the plot
for row in processedDF.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty, 2),
horizontalalignment= 'center', verticalalignment='bottom',
fontsize=14)
# Display the plot on the screen
plt.show()

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 35


AD3301 DEV UNIT 1

Output: Refer Figure 1.13

Figure 1.13 – Lollipop Chart Example

Choosing the best chart


o If continuous variables, then a histogram would be a good choice.
o To show ranking, an ordered bar chart would be a good choice.
o Simplicity is best.
o Choose a diagram that does not overload the audience with information.

Table 1.1 shows the different types of charts based on the purposes:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 36


AD3301 DEV UNIT 1

4. Explain in detail about data transformation techniques - merging


database, reshaping and pivoting. (APR/MAY 2024)
Data Transformation
 Data transformation is a set of techniques used to convert data from one
format or structure to another format or structure.
 The main reason for transforming the data is to get a better
representation such that the transformed data is compatible with other
data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 37


AD3301 DEV UNIT 1

Merging Databases

 In the preceding dataset, the first column contains information about


student identifiers and the second column contains their respective
scores in any subject.
 The structure of the dataframes is the same in both cases.
 Two dataframes from each subject:
o Two for the Software Engineering course
o Two for the Introduction to Machine Learning course

b. Concatenate using the pd.concat() method from the pandas library


 Code:
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
df = pd.concat([dfML, dfSE], axis=1)
df
 Concatenated the dataframes from the Software Engineering course
and the Machine Learning course.
 Then, concatenated the dataframes with axis =1 to place them side by
side.
 The output of the preceding code is as follows:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 38


AD3301 DEV UNIT 1

c. Using df.merge with an inner join


df.merge()
 Merge DataFrame or named Series objects with a database-style join.
 A named Series object is treated as a DataFrame with a single named
column.

Syntax:
DataFrame.merge(right, how='inner', on=None, left_on=None,
right_on=None, left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)
Parameters
 Right -DataFrame or named Series
 how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
 Type of merge to be performed.
o left: use only keys from left frame,
o right: use only keys from right frame,
o outer: use union of keys from both frames,
o inner: use intersection of keys from both frames,

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 39


AD3301 DEV UNIT 1

o cross: creates the cartesian product from both frames,


 on-label or list Column or index level names to join on.
 left_on-label or list, or array-like Column or index level names to join
on in the left DataFrame.
 right_onlabel or list, or array-like Column or index level names to join
on in the right DataFrame.
 left_index - bool, default False
 right_index - bool, default False
 sort - bool, default False
 suffixes - list-like, default is (“_x”, “_y”)
 copy - bool, default True
 indicator - bool or str, default False
 validate - str, optional

Example
1. Default Merging - inner join
import pandas as pd
d1 = {'Name': ['Pankaj', 'Meghna', 'Lisa'], 'Country': ['India', 'India', 'USA'],
'Role': ['CEO', 'CTO', 'CTO']}
df1 = pd.DataFrame(d1)
print('DataFrame 1:\n', df1)

df2 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Pankaj', 'Anupam', 'Amit']})


print('DataFrame 2:\n', df2)

df_merged = df1.merge(df2)
print('Result:\n', df_merged)

Output:
DataFrame 1:
Name Country Role
0 Pankaj India CEO
1 Meghna India CTO
2 Lisa USA CTO
DataFrame 2:
ID Name

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 40


AD3301 DEV UNIT 1

0 1 Pankaj
1 2 Anupam
2 3 Amit
Result:
Name Country Role ID
0 Pankaj India CEO 1

2. Merging DataFrames with Left, Right, and Outer Join


print('Result Left Join:\n', df1.merge(df2, how='left'))
print('Result Right Join:\n', df1.merge(df2, how='right'))
print('Result Outer Join:\n', df1.merge(df2, how='outer'))
Output:
Result Left Join:
Name Country Role ID
0 Pankaj India CEO 1.0
1 Meghna India CTO NaN
2 Lisa USA CTO NaN
Result Right Join:
Name Country Role ID
0 Pankaj India CEO 1
1 Anupam NaN NaN 2
2 Amit NaN NaN 3
Result Outer Join:
Name Country Role ID
0 Pankaj India CEO 1.0
1 Meghna India CTO NaN
2 Lisa USA CTO NaN
3 Anupam NaN NaN 2.0
4 Amit NaN NaN 3.0

3. Merging Data Frame on Specific Columns


import pandas as pd
d1 = {'Name': ['Pankaj', 'Meghna', 'Lisa'], 'ID': [1, 2, 3], 'Country': ['India',
'India', 'USA'], 'Role': ['CEO', 'CTO', 'CTO']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Pankaj', 'Anupam', 'Amit']})

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 41


AD3301 DEV UNIT 1

print(df1.merge(df2, on='ID'))
print(df1.merge(df2, on='Name'))
Output:
Name_x ID Country Role Name_y
0 Pankaj 1 India CEO Pankaj
1 Meghna 2 India CTO Anupam
2 Lisa 3 USA CTO Amit

Name ID_x Country Role ID_y


0 Pankaj 1 India CEO 1

4. Specify Left and Right Columns for Merging DataFrame Objects


import pandas as pd
d1 = {'Name': ['Pankaj', 'Meghna', 'Lisa'], 'ID1': [1, 2, 3], 'Country': ['India',
'India', 'USA'], 'Role': ['CEO', 'CTO', 'CTO']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame({'ID2': [1, 2, 3], 'Name': ['Pankaj', 'Anupam', 'Amit']})
print(df1.merge(df2))
print(df1.merge(df2, left_on='ID1', right_on='ID2'))

Output;
Name ID1 Country Role ID2
0 Pankaj 1 India CEO 1

Name_x ID1 Country Role ID2 Name_y


0 Pankaj 1 India CEO 1 Pankaj
1 Meghna 2 India CTO 2 Anupam
2 Lisa 3 USA CTO 3 Amit

Merging on index
 Sometimes the keys for merging dataframes are located in the dataframes
index. In such a situation, can pass left_index=True or right_index=True to
indicate that the index should be accepted as the merge key.
 Merging on index is done in the following steps:
1. Consider the following two dataframes:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 42


AD3301 DEV UNIT 1

left1 = pd.DataFrame({'key': ['apple','ball','apple', 'apple','ball', 'cat'],


'value': range(6)})
right1 = pd.DataFrame({'group_val': [33.4, 5]}, index=['apple','ball'])
Output:

The keys in the first dataframe are apple, ball, and cat. In the second
dataframe, have group values for the keys apple and ball.

Merging using an inner join


Code:
df = pd.merge(left1, right1, left_on='key', right_index=True)
df
Output:

The output is the intersection of the keys from these dataframes.


Since there is no cat key in the second dataframe, it is not included
in the final table.

Merging using an outer join:


Code:
df = pd.merge(left1, right1, left_on='key', right_index=True,
how='outer')
df

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 43


AD3301 DEV UNIT 1

Output

The last row includes the cat key. This is because of the outer join.

RESHAPING AND PIVOTING


To rearrange data in a dataframe in some consistent manner can be done
with hierarchical indexing using two actions:
 Stacking: Stack rotates from any particular column in the data to
the rows.
 Unstacking: Unstack rotates from the rows into the column.

1. Create a dataframe that records the rainfall, humidity, and wind


Conditions of five different counties in Norway:
Code:
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen',
'Oslo', 'Trondheim', 'Stavanger', 'Kristiansand'])
dframe1

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 44


AD3301 DEV UNIT 1

2. Using the stack() method can pivot the columns into rows to produce a
series:
Code:
stacked = dframe1.stack()
stacked
Output:

3. Unstacked in the variable can be rearranged into a dataframe using the


unstack() method:
Code
stacked.unstack()
This should revert the series into the original dataframe.
Unstack the concatenated frame:
Code
series1 = pd.Series([000, 111, 222, 333], index=['zeros','ones','twos',
'threes'])
series2 = pd.Series([444, 555, 666], index=['fours', 'fives','sixes'])
frame2 = pd.concat([series1, series2], keys=['Number1', 'Number2'])

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 45


AD3301 DEV UNIT 1

frame2.unstack()
Output:

 Since in series1, there are no fours, fives, and sixes, their values are stored
as NaN during the unstacking process. There are no ones, twos, and zeros
in series2, so the corresponding values are stored as NaN.

5. Explain in detail about data transformation techniques.


Data Transformation
 Data transformation is a set of techniques used to convert data from one format or
structure to another format or structure.
 The main reason for transforming the data is to get a better representation such
that the transformed data is compatible with other data.
 Examples of transformation activities:
o Data deduplication involves the identification of duplicates and their removal.
o Key restructuring involves transforming any keys with built-in meanings to
the generic keys.
o Data cleansing involves extracting words and deleting out-of-date, inaccurate,
and incomplete information from the source language without extracting the
meaning or information to enhance the accuracy of the source data.
o Data validation is a process of formulating rules or algorithms that help in
validating different types of data against some known issues.
o Format revisioning involves converting from one format to another.
o Data derivation consists of creating a set of rules to generate more information
from the data source.
o Data aggregation involves searching, extracting, summarizing, and preserving
important information in different types of reporting systems.
o Data integration involves converting different data types and merging them
into a common structure or schema.
o Data filtering involves identifying information relevant to any particular user.
o Data joining involves establishing a relationship between two or more tables.

 Performing data deduplication


 If dataframe contains duplicate rows, removing them is essential to

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 46


AD3301 DEV UNIT 1

enhance the quality of the dataset.

 Steps:
1. Let's consider a simple dataframe:
Code:
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions']
4, 'column 2': [10, 10, 22, 23, 23, 24, 24]})
frame3
Output: - creates a simple dataframe with two columns

2. The pandas dataframe has duplicated() method that returns a Boolean


series stating which of the rows are duplicates:
frame3.duplicated()
Output:

The rows with True are the ones that contain duplicated data.
3. Can drop these duplicates using the drop_duplicates() method:
Code:
frame4 = frame3.drop_duplicates()
frame4

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 47


AD3301 DEV UNIT 1

Output:

The rows 1, 4, and 6 are removed. Basically, both the duplicated() and
drop_duplicates() methods consider all of the columns for comparison.

 Replacing values
 It is essential to find and replace some values inside a dataframe.
 Steps:
Use the replace method:
import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786.,
3000., 234., 444., -786., 332., 3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)
The output of the preceding code is as follows:

Replaces one value with the other values.


 Can also replace multiple values at once.
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786.,
3000., 234., 444., -786., 332., 3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 48


AD3301 DEV UNIT 1

There are two replacements. All -786 values will be replaced by NaN
and all 0 values will be replaced by 2.
 Handling missing data
 Whenever there are missing values, a NaN value is used, which indicates
that there is no value specified for that particular index.
 Reason for Incomplete data:
o Data is retrieved from an external source.
o Joining two different datasets and some values are not matched.
o Missing values due to data collection errors.
o When the shape of data changes, there are new additional rows or
columns that are not determined.
o Reindexing of data can result in incomplete data.
 Characteristics of missing values in the preceding dataframe:
 An entire row can contain NaN values.
 An entire column can contain NaN values.
 Some values in both a row and a column can be NaN.
 Example Code
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes',
'mango'], columns=['store1', 'store2', 'store3'])
dfx

Output

Code - Add some missing values to dataframe:


dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 49


AD3301 DEV UNIT 1

dfx

Output

 isnull() function from the pandas library - to identify NaN values


dfx.isnull() - the function will indicate True for the values which are null.

Output:

dfx.notnull() - the function will indicate True for the values which
are not null.
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 50


AD3301 DEV UNIT 1

The sum() method is used to count the number of NaN values in each
store.
dfx.isnull().sum()
Output:
store1 1
store2 1
store3 1
store4 5
store5 7
dtype: int64

To find the total number of missing values


dfx.isnull().sum().sum()
Output:
1. - This indicates 15 missing values

 Dropping missing values


 dropna() method is used to remove the missing values from rows.
 The dropna() method just returns a copy of the dataframe by dropping
the rows with NaN.
dfx.store4.dropna()
Output:
apple 20.0
watermelon 18.0
Name: store4, dtype: float64

dfx.dropna(how='all')
 drops only rows whose entire values are NaN

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 51


AD3301 DEV UNIT 1

Output:

 The orange is removed because those entire row contains NaN


values.

 Dropping by columns
dfx.dropna(how='all', axis=1)
Output - store5 is dropped from the dataframe.

 thresh, to specify a minimum number of NaNs that must exist


before the column should be dropped
dfx.dropna(thresh=5, axis=1)
Output

 Filling missing values


 Use the fillna() method to replace NaN values with any particular
values.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 52


AD3301 DEV UNIT 1

Code:
filledDf = dfx.fillna(0)
filledDf

Output

Types of filling:
Forwardfill
 ffill() – replace the null values with the value from the previous row
or previous column based on axis parameter.
dfx.store4.fillna(method='ffill')
Output:
apple 20.0
banana 20.0
kiwi 20.0
grapes 20.0
mango 20.0
watermelon 18.0
oranges 18.0
Name: store4, dtype: float64
 In forward-filling technique, the last known value is 20 and hence
the rest of the NaN values are replaced by it.
Backwardfill
 bfill() – backward fill the missing values in the data set.
dfx.store4.fillna(method='bfill')
Output:
apple 20.0
banana 18.0
kiwi 18.0
grapes 18.0
mango 18.0

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 53


AD3301 DEV UNIT 1

watermelon 18.0
oranges NaN
Name: store4, dtype: float64
 Discretization and binning
 It is used to convert continuous data into discrete or interval forms. Each
interval is referred to as a bin.
Example
Let the heights of a group of students as follows:
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145,
141,132]
To convert that dataset into intervals of 118 to 125, 126 to 135, 136 to
160, and finally 160 and higher, use the cut() method in pandas:
bins = [118, 125, 135, 160, 200]
category = pd.cut(height, bins)
category
Output:
[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ...,
(125, 135], (160, 200], (135, 160], (135, 160], (125, 135]] Length:
12 Categories (4, interval[int64]): [(118, 125] < (125, 135] <
(135, 160] < (160, 200]]
 A parenthesis indicates that the side is open.
 A square bracket means that it is closed or inclusive.
 Example - (118, 125] means the left-hand side is open and the
right-hand side is closed.
 This is mathematically denoted as follows:

 Hence, 118 is not included, but anything greater than 118 is


included, while 125 is included in the interval.
 Can set a right=False argument to change the form of interval, the
results are in the form of right-closed, left-open.
 Example : [118, 126)
Can check the number of values in each bin by using the
pd.value_counts() method:
pd.value_counts(category)
Output:
(118, 125] 5

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 54


AD3301 DEV UNIT 1

(135, 160] 3
(125, 135] 3
(160, 200] 1
dtype: int64
 Outlier detection and filtering
 Outliers are data points that diverge from other observations for several
reasons.
 Common task is to detect and filter these outliers.
 The main reason for this detection and filtering of outliers is that the
presence of such outliers can cause serious issues in statistical analysis.
Example

df[np.abs(TotalTransaction) > 300000]

 Displays all the columns and rows from the preceding table if TotalPrice
is greater than 300000 and any price greater than 3,000,000 is an
outlier.
Output:
2 3711433
7 3965328
13 4758900
15 5189372
17 3989325
...
9977 3475824
9984 5251134
9987 5670420
9991 5735513
9996 3018490

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 55


AD3301 DEV UNIT 1

Name: TotalPrice, Length: 2094, dtype: int64

 Permutation and random sampling


 NumPy's numpy.random.permutation() function, randomly select or permute
a series of rows in a dataframe.
sampler = np.random.permutation(10)
sampler
Output:
array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])

 Random sampling without replacement


1. To perform random sampling without replacement, first create a
permutation array.
2. Next, slice off the first n elements of the array where n is the desired size
of the subset want to sample.
3. Then use the df.take() method to obtain actual samples:
df.take(np.random.permutation(len(df))[:3])

 Random sampling with replacement


1. Generate a random sample with replacement using the
numpy.random.randint() method and drawing random integers:
sack = np.array([4, 8, -2, 7, 5])
sampler = np.random.randint(0, len(sack), size = 10)
sampler
2. And can draw the required samples:
draw = sack.take(sampler)
draw

 Computing indicators/dummy variables


 To convert a categorical variable into some dummy matrix
Example: A dataframe with data on gender and votes,
df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male',
'female'], 'votes': range(6, 12, 1)})
df
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 56


AD3301 DEV UNIT 1

 To encode these values in a matrix form with 1 and 0 values, using


get_dummies() function:
pd.get_dummies(df['gender'])
Output:

There are five values in the original dataframe with three unique values
of male, female, and unknown. Each unique value is transformed into a
column and each original value into a row.

 Benefits of data transformation


 Data transformation promotes interoperability between several applications.
 Comprehensibility for both humans and computers is improved.
 Data transformation ensures a higher degree of data quality and protects
applications from several computational challenges such as null values,
unexpected duplicates, and incorrect indexing.
 Data transformation ensures higher performance and scalability for modern
analytical databases and data frames.
 Challenges of Data Transformation
 It requires a qualified team of experts and state-of-the-art infrastructure.
The cost of attaining is high.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 57


AD3301 DEV UNIT 1

 Data transformation requires data cleaning before data transformation and


data migration. It is time-consuming.
 The activities of data transformations involve batch processing. It can be
very slow.

6. Explain in detail about Grouping Datasets. Or fundamentals of grouping


techniques and how doing this can improve data analysis.
Groupby ()
 During the data analysis phase, categorizing a dataset into multiple categories
or groups is often essential.
 The pandas groupby function is one of the most efficient and time-saving
features for doing this.
 Groupby provides functionalities that allow us to split-apply-combine
throughout the data frame.
Groupby mechanics
 Grouping by features, hierarchically
 Aggregating a dataset by groups
 Applying custom aggregation functions to groups
 Transforming a dataset groupwise

Example
 groupby() function lets us group this dataset on the basis of the department
column:
df.groupby(‘department’).groups.keys()

Output:
dict_keys(['AI&DS', 'CSBS‘, 'CSE', 'IT'])

 To get values from group


# Group the dataset by the column department
style = df.groupby(‘department')
# Get values items from group with value AI&DS
style.get_group("AI&DS")

Selecting a subset of columns


 To form groups based on multiple categories, specify the column names in

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 58


AD3301 DEV UNIT 1

the groupby() function.


 Example
double_grouping = df.groupby(["department","gender"])

Groupby Functions

max(), min(), mean(), first(), and last()

Example
style = df.groupby(‘department')
style[’subject1’].max()
style[‘subject1’].min()

# mean () will print mean of numerical column in each group


style.mean()
style.get_group("AI&DS").mean()

# count () get the number of symboling/records in each group


style[‘department’].count()

Data Aggregation.
 Aggregation is the process of implementing any mathematical operation on a
dataset or a subset of it.
 The Dataframe.aggregate() function is used to apply aggregation across one or
more columns.
 Can apply aggregation in a DataFrame, df, as df.aggregate() or df.agg().
 Aggregation only works with numeric type columns,
Example
# new dataframe that consist length,width,height,curb-weight and price
new_dataset = df.filter(["length","width","height","curbweight","
price"],axis=1)
new_dataset
Output

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 59


AD3301 DEV UNIT 1

Code
# applying single aggregation for mean over the columns
new_dataset.agg("mean", axis="rows")

Output:
length 0.837102
width 0.915126
height 53.766667
curb-weight 2555.666667
price 13207.129353
dtype: float64
Code
# applying aggregation sum and minimum across all the
columns
new_dataset.agg(['sum', 'min'])

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 60


AD3301 DEV UNIT 1

Group-wise transformations

 Performing a transformation on a group or a column returns an object that


is indexed by the same axis length as itself.
 It is an operation that's used in conjunction with groupby().
 The aggregation operation has to return a reduced version of the data,
whereas the transformation operation can return a transformed version of
the full data.
 Example
df["price"]=df["price"].transform()

7. Explain in detail about Pivot tables and cross-tabulations. Or


What is cross-tabulation and Pivot table? How to Build Pivot Table and
Cross Table Reports? (NOV/DEC 2023) (APR/ MAY 2024)
Pivot tables
 The pandas.pivot_table() function creates a spreadsheet-style pivot table as a
dataframe.

Syntax
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All',
observed=False, sort=True)

Parameters
 Data - DataFrame
 Values - list-like or scalar, optional
 Index - column, Grouper, array, or list of the previous
 Columns - column, Grouper, array, or list of the previous
 Aggfunc - function, list of functions, dict, default numpy.mean
 fill_value - scalar, default None
 margins - bool, default False
 dropna - bool, default True
 margins_name- str, default ‘All’
 Observed - bool, default False
 Sort - bool, default True

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 61


AD3301 DEV UNIT 1

Example
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
... "bar", "bar", "bar", "bar"],
... "B": ["one", "one", "one", "two", "two",
... "one", "one", "two", "two"],
... "C": ["small", "large", "large", "small",
... "small", "large", "small", "small",
... "large"],
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
Output
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
8 bar two large 7 9
Example
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table

Output
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 62


AD3301 DEV UNIT 1

Example
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
Output
C large small
A B
bar one 4 5
two 7 6
foo one 4 1
two 0 6

 Cross-tabulations
 Used to compute a simple cross-tabulation of two (or more) factors.
Syntax:
pandas.crosstab(index, columns, values=None, rownames=None,
colnames=None, aggfunc=None, margins=False, margins_name=’All’,
dropna=True, normalize=False)
Arguments:
 index: Series, or list of arrays/Series, Values to group by in the rows.
 columns: Series, or list of arrays/Series, Values to group by in the columns.
 values: array of values to aggregate according to the factors.
 rownames: sequence, default None, If passed, must match number of row
arrays passed.
 colnames: sequence, default None, If passed, must match number of column
arrays passed.
 aggfunc: function, optional,
 margins: bool, default False,
 margins_name: str, default ‘All’.
 dropna: bool, default True,

Example
# importing packages
import pandas
import numpy
# creating some data

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 63


AD3301 DEV UNIT 1

a = numpy.array(["foo", "foo", "foo", "foo",


"bar", "bar", "bar", "bar",
"foo", "foo", "foo"],
dtype=object)
b = numpy.array(["one", "one", "one", "two",
"one", "one", "one", "two",
"two", "two", "one"],
dtype=object)
c = numpy.array(["dull", "dull", "shiny",
"dull", "dull", "shiny",
"shiny", "dull", "shiny",
"shiny", "shiny"],
dtype=object)
# form the cross tab
pandas.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])

Output

8. Discuss about Descriptive Ststistics in exploratory analysis.

Descriptive Statistics in Exploratory Data Analysis (EDA)


Descriptive statistics play a crucial role in Exploratory Data Analysis
(EDA). They help summarize and describe the main features of a dataset,
providing insight into its structure and patterns before performing more
complex analyses.
Descriptive Statistics
Descriptive statistics involve methods for organizing, displaying, and
describing data using:
 Measures of central tendency
 Measures of dispersion (spread)
 Measures of shape and distribution
 Tabular and graphical summaries
Key Components of Descriptive Statistics:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 64


AD3301 DEV UNIT 1

1. Measures of Central Tendency


These show where the data is centered.
 Mean: The average of the data.
 Median: The middle value (less affected by outliers).
 Mode: The most frequent value.
2. Measures of Dispersion (Spread)
These indicate the variability or spread in the data.
 Range: Difference between the maximum and minimum.
 Variance: Average squared deviation from the mean.
 Standard deviation: Square root of variance; indicates how much data
deviates from the mean.
 Interquartile Range (IQR): Difference between Q3 and Q1, shows
middle 50% of the data.
3. Measures of Shape
These help understand the distribution of data.
 Skewness: Shows whether data is symmetric or skewed (left/right).
 Kurtosis: Measures the "tailedness" of the distribution.
Graphical Tools in Descriptive Statistics:
Used for visualizing and understanding data patterns.
 Histogram: Shows frequency distribution.
 Box plot: Displays median, quartiles, and outliers.
 Bar chart: Good for categorical data.
 Scatter plot: Shows relationships between two variables.
 Pie chart: Illustrates proportion of categories.
Descriptive Statistics Important in EDA
1. Understanding Data Quality: Detects missing values, anomalies, or
errors.
2. Identifying Trends: Highlights dominant patterns in data.
3. Detecting Outliers: Finds extreme values that may distort analysis.
4. Comparing Groups: Enables comparison across different categories or
groups.
5. Data Reduction: Summarizes large datasets into meaningful figures.
Example:
If you’re analyzing the sales data of a company:
 Mean Sales: ₹25,000
 Median Sales: ₹24,500

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 65


AD3301 DEV UNIT 1

 Standard Deviation: ₹5,000


 Skewness: Positive (means a few very high sales)
This would suggest that most sales are clustered around ₹24,500–₹25,000,
but a few high values increase the mean.

9. Explain in detail about Comperative Statistics in exploratory analysis

Comparative Statistics in Exploratory Data Analysis (EDA)


Comparative statistics is a critical aspect of exploratory data analysis
(EDA) that focuses on comparing different groups or categories within a
dataset to uncover patterns, trends, and relationships. The goal is to
understand how variables differ across groups (e.g., by gender, age, region,
treatment type) and identify significant differences or similarities.
Purpose of Comparative Statistics:
1. Understand group differences
o E.g., Do male and female customers have different average purchase
amounts?
2. Identify trends or patterns
o E.g., Is there a seasonal trend in sales across regions?
3. Generate hypotheses
o Helps suggest relationships or effects worth testing in formal statistical
analysis.
4. Support decision making
o E.g., Which customer segment is more profitable?
Common Techniques in Comparative Statistics:
1. Descriptive Statistics by Group
 Compare mean, median, standard deviation across different categories.
 Example:
plaintext
CopyEdit
Mean income:
- Group A: $45,000
- Group B: $60,000
2. Group-wise Distribution Analysis
 Histograms, boxplots, or density plots help visualize differences in
distribution.
 Example: Compare income distribution by education level.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 66


AD3301 DEV UNIT 1

3. Cross-tabulations / Contingency Tables


 Used for categorical data.
 Shows frequency distribution of variables.
 Example: Number of users by gender and subscription plan.
4. Bar Charts / Side-by-side Bar Charts
 Great for comparing categorical variables across groups.
 Example: Count of customers per product type across regions.
5. Boxplots
 Visualizes median, interquartile range, and outliers for numeric data across
categories.
 Ideal for quick comparison of spread and central tendency.
6. Grouped Summary Tables
 Aggregated stats like:
o Mean age by department
o Total sales by region and quarter
7. Standardized Measures
 Use of Z-scores or normalized values to compare different scale metrics across
groups.

Example Case: Comparing Customer Spending by Age Group


Suppose we have a dataset with customer age and spending:
Age Group Mean Spending ($) Std Dev

18–25 220 50

26–35 340 70

36–45 410 80
Insights:
 Spending increases with age.
 36–45 group is the highest spending segment.

Best Practices:
 Always include visualizations to support numerical comparisons.
 Normalize values when comparing across different scales.
 Be cautious of outliers that may distort comparisons.
 Use statistical tests (e.g., t-test, ANOVA) for rigorous comparison after initial
EDA.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 67


AD3301 DEV UNIT 1

10. Examine the function that distributions play in Bayesian analysis


when EDA is being performed.What impact do they have on the
conclusions that can be drawn from the data ? (Apr/May 2025)

Role of Distributions in Bayesian Analysis during EDA

1. Understanding the Prior Beliefs

 In Bayesian analysis, prior distributions represent our beliefs about


parameters before seeing the data.
 During EDA, we analyze the shape, spread, skewness, modality, and outliers
of data distributions.
 This helps guide the selection of appropriate priors, e.g.:
o Gaussian distribution for symmetric data.
o Beta distribution for probabilities.
o Gamma or Exponential for skewed positive data.

2. Informing the Likelihood Model

 EDA helps identify the data-generating distribution (likelihood function).


 For example:
o Count data → Poisson or Negative Binomial
o Time-to-event data → Exponential or Weibull
 Misidentifying this can lead to poor posterior inference.

3. Detecting Outliers and Skewness

 Outliers can bias Bayesian posteriors, especially with non-robust priors.


 EDA tools (histograms, boxplots, density plots) help decide whether to use
robust distributions (e.g., Student-t instead of Normal).

4. Checking Assumptions for Conjugacy

 Some Bayesian models use conjugate priors for mathematical convenience.


 EDA helps determine if assumptions like normality or independence are valid,
or if more flexible priors are needed.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 68


AD3301 DEV UNIT 1

Impact of Distributions on Bayesian Conclusions

1. Posterior Distribution Shapes


o The posterior is shaped by the interaction of prior and likelihood.
o An incorrect assumption about the data’s distribution can lead to
misleading posterior inferences.
2. Credible Intervals and Decision Making
o EDA-derived distributions impact the credible intervals (Bayesian
equivalent of confidence intervals).
o This affects how certain we are about estimates (e.g., mean, variance).
3. Model Comparison and Predictive Accuracy
o Bayesian model selection (e.g., via Bayes Factors) is sensitive to the
assumed distribution.
o Good distributional assumptions lead to better predictive
performance.
4. Robustness and Interpretability
o Properly understanding distributions during EDA increases
trustworthiness of Bayesian conclusions, especially in real-world
applications (e.g., medicine, finance)

11. Explore the capabilities of Tableau as an enterprise data analysis tool.


Within the context of user experience and data exploration, how does its drag
and drop interface impact these aspects? (Apr/May 2025)

Tableau is a leading enterprise data analysis and visualization tool, widely


recognized for its user-friendly interface, robust data connectivity, and interactive
dashboards. Let's explore its capabilities and specifically how its drag-and-drop
interface enhances user experience and data exploration.

Capabilities of Tableau as an Enterprise Data Analysis Tool

1. Data Connectivity

 Connects to a wide range of sources: Excel, SQL, Google BigQuery, Snowflake,


etc.
 Supports real-time and in-memory data analysis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 69


AD3301 DEV UNIT 1

2. Interactive Visualizations

 Allows creation of dynamic and interactive charts, graphs, maps, and


dashboards.
 Built-in chart types: bar, line, scatter, treemaps, heatmaps, Gantt charts, etc.

3. Dashboards & Storytelling

 Combine multiple visualizations into dashboards.


 Use "Story" feature to narrate data insights step by step.

4. Real-Time Collaboration and Sharing

 Tableau Server and Tableau Online allow publishing, sharing, and


collaboration.
 Security features support enterprise-level governance.

5. Advanced Analytics

 Built-in statistical models (trend lines, forecasting).


 Integration with R and Python for custom analytics and machine learning.

Impact of Drag-and-Drop Interface on User Experience and Data Exploration

1. Intuitive and Non-Technical Friendly

 No coding required: Business users, analysts, and even non-technical


stakeholders can create reports.
 Visual elements (dimensions, measures) are simply dragged to rows, columns,
filters, and marks.

2. Rapid Prototyping & Insight Discovery

 Immediate feedback: Visuals are updated in real-time as fields are dragged.


 Encourages experimentation and exploration of data patterns without writing
SQL or code.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 70


AD3301 DEV UNIT 1

3. Contextual Suggestions

 Tableau offers visual recommendations based on the data type (Show Me


panel).
 Helps users choose the most appropriate chart type for the data they are
analyzing.

4. Improved User Engagement

 The interface promotes interaction (clicking on a point filters or drills down).


 Users stay engaged and can ask questions directly through the visual
interface.

5. Reduces Learning Curve

 Especially useful in enterprise settings where users vary in technical skill.


 Onboarding new users becomes faster due to the visual learning model.

Comparison: Traditional Tools vs Tableau

Feature Traditional Tools (e.g., Excel, SQL) Tableau

Coding Needed Often Yes No (drag-and-drop)

Visual Feedback Limited Real-time

Exploration Manual Interactive

User Engagement Low High

Scalability Medium High (with Server/Online)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 71


AD3301 DEV UNIT 1

IMPORTANT QUESTIONS
PART – A
1. Define Exploratory Data Analysis (EDA). Or What does EDA mean in data?
(NOV/DEC 2023).
2. State the purpose of data aggregation. (NOV/DEC 2023).
PART -B
1. Provide an explanation of the various EDA tools that are used for data
analysis. (NOV/DEC 2023).
2. Explain in detail about Pivot tables and cross-tabulations. Or What is
cross-tabulation and Pivot table? How to Build Pivot Table and Cross
Table Reports? (NOV/DEC 2023).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head / AI&DS 72


AD3301 DEV UNIT 2

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

II YEAR / III SEM

AD3301 DATA EXPLORATION AND VISUALIZATION

SYLLABUS

UNIT II

VISUALIZING USING MATPLOTLIB

SYLLABUS: Importing Matplotlib – Simple line plots – Simple scatter plots –


visualizing errors – density and contour plots – Histograms – legends – colors –
subplots – text and annotation – customization – three dimensional plotting -
Geographic Data with Basemap - Visualization with Seaborn.

PART A
1. What is a Matplotlib? Or What is Matplotlib used for? (NOV/DEC 2023)
 Matplotlib is a low-level graph plotting library in python that serves as a
visualization utility.
 Matplotlib was created by John D. Hunter.
 Matplotlib is open source and can use it freely.
 Matplotlib is mostly written in python, a few segments are written in C,
Objective-C and JavaScript for Platform compatibility.

2. What is a simple line plot?


 A line plot is created by connecting the values in the input data with straight
lines.
 Line plots are used to determine the relation between two datasets. A dataset
is a collection of values.
 Each dataset is plotted along an axis ie., x and y axis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 1


AD3301 DEV UNIT 2

3. List down the steps involved in line plot.


i. Determine the independent and dependent variables.
ii. Set the independent variable on the horizontal axis and the dependent
variable on the vertical axis with appropriate units.
iii. Choose the horizontal and vertical axis scale depending upon the minimum
and maximum value of the independent and dependent variables,
respectively.
iv. Use a kink if actual data is far away from the starting value zero.
v. Plot the data points.
vi. Connect the data points with straight lines.

4. What is a Simple scatterplot?


 Scatter plots are the graphs that present the relationship between two
variables in a data-set.
 It represents data points on a two-dimensional plane.
 The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis.

5. How do you visualize error?


 By incorporating error bars into visualizations, such as bar charts, line
graphs, or scatter plots, viewers can better understand the uncertainty or
variability of the data and make more informed interpretations of the results.

6. What is a Contour plot?


 A contour plot is a graphical technique for representing a 3-dimensional
surface by plotting constant z slices, called contours, on a 2-dimensional
format.
 That is, given a value for z, lines are drawn for connecting the (x,y) coordinates
where that z value occurs.

7. How will you create a histogram in Matplotlib?


 In Matplotlib, the hist() function is used to create histograms.
 The hist() function will use an array of numbers to create a histogram, the
array is sent into the function as an argument.

8. What is a Subplot?
 The matplotlib. pyplot. subplots method provides a way to plot multiple plots
on a single figure. Given the number of rows and columns, it returns a tuple
( fig , ax ), giving a single figure fig with an array of axes ax .

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 2


AD3301 DEV UNIT 2

9. How to create Subplots in Python using Matplotlib?


 Pl. Subplot(), creates a single subplot within a grid.
 This function takes three integer arguments—the number of rows, the number
of columns, and the index of the plot to be created in this scheme, which runs
from the upper left to the bottom right

10. How do you annotate text and graph?


 The annotate() function in pyplot module of matplotlib library is used
to annotate the point xy with text s.
 Can add texts over matplotlib charts by using the text and figtext
functions

11. Brief three dimensional plotting in Python using Matplotlib.


 Enables three-dimensional plots by importing the mplot3d toolkit, included with
the main Matplotlib installation:
from mpl_toolkits import mplot3d
 Once this submodule is imported, can create a three-dimensional axes by passing
the keyword projection='3d' to any of the normal axes creation routines:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')

12. What is Geographic data with Basemap?


 One common type of visualization in data science is that of geographic
data.
 Matplotlib's main tool for this type of visualization is the Basemap toolkit,
which is one of several Matplotlib toolkits which lives under the
mpl_toolkits namespace.

13. What is Visualization with Seaborn?


 Seaborn is a library for making statistical graphics in Python. It builds on
top of matplotlib and integrates closely with pandas data structures.
 Seaborn is a library that uses Matplotlib underneath to plot graphs. It will
be used to visualize random distributions.
 Seaborn is an amazing visualization library for statistical graphics plotting
in Python.
 It provides beautiful default styles and color palettes to make statistical
plots more attractive.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 3


AD3301 DEV UNIT 2

14. Define a KDE plot.


 KDE plot is a Kernel Distribution Estimation Plot which depicts the
probability density function of the continuous or non-parametric data
variables i.e. can plot for the univariate or multiple variables altogether.

15. Differentiate Bivariate and Univariate data using seaborn and pandas.

16. What is a Boxplot?


 A Box Plot is also known as Whisker plot is created to display the summary
of the set of data values having properties like minimum, first quartile,
median, third quartile and maximum.

17. What are the different map projections available?


 There are three types of map projections: azimuthal projections, conformal
projections, and equal-area projections.
 Each projection differs from the others by which of the four characteristics
of a map projection (i.e., area, shape, distance, and direction) it preserves
and which ones it compromises.

18. Define Customization.


 The meaning of customize is to build, fit, or alter according to individual
specifications.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 4


AD3301 DEV UNIT 2

19. Distinguish plot Vs. Scatter.


 Scatter Plot displays relationships between vital data variables. On the
other hand, a Line Graph shows trends and changes in variables.

20. How do you install Matplotlib?


Importing matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
21. List the software and hardware components required for data
visualization (APR /MAY 2024)

Software:

 Data Visualization Software

 Spreadsheet Software

 Open Source Libraries

 Web-based Platforms

 Business Intelligence (BI) Tools

Hardware:

 Computers: Process and display data visualizations.

 Monitors/Displays: Show visualizations.

 Projectors: Project visualizations on large screens.

 Interactive Displays: Allow for interactive exploration of visualizations.

 Input Devices: Keyboards and mice for interacting with the software and
visualizations.

22. Why is it important to lablel axes and provide a tittle in a line plot? (APR

/MAY 2025)

 Clarifies the Data


 Provides Context
 Improves Interpretation
 Avoids Confusion
 Professional Presentation

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 5


AD3301 DEV UNIT 2

23. A confusion matrix is used in error visualization: What is the relevance


of this tool? (APR /MAY 2025)

A confusion matrix is a vital tool in classification problems for


evaluating the performance of a machine learning model. It provides a
detailed breakdown of correct and incorrect predictions, allowing for better
understanding and analysis of errors made by the model.

Relevance of the Confusion Matrix:

1. Breaks Down Performance Beyond Accuracy

2. Reveals Types of Errors

3. Supports Better Metric Calculation

4. Visual Diagnostic Tool

5. Guides Model Improvement

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 6


AD3301 DEV UNIT 2

PART B

1. Explain in detail about importing Matplotlib.(NOV/DEC 2024)


Matplotlib
 Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.
 Matplotlib was created by John D. Hunter.
 Matplotlib is open source and can use it freely.
 Matplotlib is mostly written in python, a few segments are written in C,
Objective-C and Javascript for Platform compatibility.

 Importing matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt

 Setting Styles
 Choose appropriate styles for the figures.
 Set the classic style, to ensure that the plots created using the
classic Matplotlib style:
plt.style.use('classic')

 Plotting from a script


 plt.show() starts an event loop, looks for all currently active figure objects,
and opens one or more interactive windows that display the figure or
figures.
# ------- file: myplot.py ------
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()

$ python myplot.py

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 7


AD3301 DEV UNIT 2

 Saving Figures to File


 Save a figure using the savefig() command.
 For example, to save the previous figure as a PNG file, can run this:
fig.savefig('my_figure.png')
a file called my_figure.png is stored in the current working directory:
from IPython.display import Image
Image('my_figure.png')

Two Interfaces for Matplotlib


 Matplotlib has dual interfaces:
o a convenient MATLAB-style state-based interface,
o a more powerful object-oriented interface.

MATLAB-style interface
 Matplotlib was a Python alternative for MATLAB user.
 The MATLAB-style tools are contained in the pyplot (plt) interface.
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1)
# (rows, columns, panel number)
plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));

Output – Refer Figure 2.1

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 8


AD3301 DEV UNIT 2

Figure 2.1 - Subplots using the MATLAB-style interface

Object-oriented interface
 In the object-oriented interface the plotting functions are methods.
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2) # Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));

Figure 2.2 - Subplots using the object-oriented interface

2. Explain in detail about Simple Line Plots using Matplotlib. (NOV/DEC


2024),(APR/MAY 2024).
Simple Line Plots
 A line plot is created by connecting the values in the input data with straight
lines.
 Line plots are used to determine the relation between two datasets. A dataset
is a collection of values.
 Each dataset is plotted along an axis ie., x and y axis.
Steps in plotting the line plot
o Determine the independent and dependent variables.
o Set the independent variable on the horizontal axis and the dependent
variable on the vertical axis with appropriate units.
o Choose the horizontal and vertical axis scale depending upon the minimum
and maximum value of the independent and dependent variables,
respectively.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 9


AD3301 DEV UNIT 2

o Use a kink if actual data is far away from the starting value zero.
o Plot the data points.
o Connect the data points with straight lines.

 Setting up the notebook for plotting and importing the functions


%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
 A figure and axes can be created as follows (Figure 2.3):
fig = plt.figure()
ax = plt.axes()

Figure 2.3 -An empty gridded axes

 In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of
as a single container that contains all the objects representing axes, graphics,
text, and labels.
 The axes (an instance of the class plt.Axes) is a bounding box with ticks and
labels, which will eventually contain the plot elements
 Once created an axes, use the ax.plot function to plot some data.
Example - A simple sinusoid (Figure 2.4):
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 10


AD3301 DEV UNIT 2

Figure 2.4-A simple sinusoid

 To create a single figure with multiple lines, call the plot function multiple
times (Figure 2.5):
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

Figure 2.5 - Over-plotting multiple lines

Adjusting the Plot: Line Colors and Styles


 The plt.plot() function takes additional arguments that can be used to
specify the color, using the color keyword, which accepts a string argument
representing virtually any imaginable color.
 Refer Figure 2.6:
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')
# Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');
# all HTML color names supported

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 11


AD3301 DEV UNIT 2

Figure 2.6 - Controlling the color of plot elements


 If no color is specified, Matplotlib will automatically cycle through a set of
default colors for multiple lines.
 Similarly, can adjust the line style using the linestyle keyword (Figure 2.7):
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted

Figure 2.7 - Example of various line styles

These linestyle and color codes can be combined into a single nonkeyword
argument to the plt.plot() function (Figure 2.8):
plt.plot(x, x + 0, '-g') # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 12


AD3301 DEV UNIT 2

plt.plot(x, x + 3, ':r'); # dotted red

Figure 2.8 - Controlling colors and styles with the shorthand syntax
 These single-character color codes reflect the standard abbreviations in the
RGB(Red/Green/Blue) and CMYK (Cyan/Magenta/Yellow/blacK) color
systems, commonly used for digital color graphics.

Adjusting the Plot: Axes Limits


 The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim()
methods (Figure 2.9):
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);

Figure 2.9 - Example of setting axis limits

 Reverse the order of the arguments (Figure 2.10):

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 13


AD3301 DEV UNIT 2

plt.plot(x, np.sin(x))
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);

Figure 2.10 - Example of reversing the y-axis

 The plt.axis() method allows to set the x and y limits with a single call, by
passing a list that specifies [xmin, xmax, ymin, ymax] (Figure 2.11):
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);

Figure 2.10 - Example of plt.axis()

 Allows higher-level specifications, one unit in x is equal to one unit in y


(Figure 2.11):
plt.plot(x, np.sin(x))
plt.axis('equal');

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 14


AD3301 DEV UNIT 2

Figure 2.11 - Example of an “equal” layout

Labeling Plots (Figure 2.12)


plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");

Figure 2.12 - Examples of axis labels and title

 When multiple lines are shown within a single axes, a plot legend labels
each line type.
Refer Figure 2.13:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')
plt.legend();

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 15


AD3301 DEV UNIT 2

Figure 2.13. Plot legend example

3. Explain in detail about scatterplots with suitable code in python using


Matplotlib. (NOV/DEC 2024)
Scatterplot
 Scatter plots are the graphs that present the relationship between two
variables in a data-set. It represents data points on a two-dimensional
plane. The independent variable or attribute is plotted on the X-axis, while
the dependent variable is plotted on the Y-axis.

 The scatter() function plots one dot for each observation. It needs two
arrays of the same length, one for the values of the x-axis, and one for
values on the y-axis
Simple Scatter Plots
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

Scatter Plots with plt.plot - Figure 2.14


x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black');

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 16


AD3301 DEV UNIT 2

Figure 2.14 - Scatter plot example

 The third argument in the function call is a character that represents the
type of symbol used for the plotting.

 Refer Figure 2.15


import matplotlib.pyplot as plt
import numpy as np

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y)
plt.show()

Figure 2.15 - Demonstration of point numbers

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 17


AD3301 DEV UNIT 2

Scatter Plots with plt.scatter – Figure 2.17


plt.scatter(x, y, marker='o');

Figure 2.17 - A simple scatter plot

 The primary difference of plt.scatter from plt.plot is that it can be used to


create scatter plots where the properties of each individual point (size, face
color, edge color, etc.) can be individually controlled or mapped to data.

Example – Refer Figure 2.18


rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3, cmap='viridis')
plt.colorbar(); # show color scale

Figure 2.18 - Changing size, color, and transparency in scatter points


Line plot vs Scatterplot
 Scatter Plot displays relationships between vital data variables.
 Line Graph shows trends and changes in variables.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 18


AD3301 DEV UNIT 2

4. How to visualize Errors using Matplotlib in Python?(NOV/DEC 2024)


Visualize error
 By incorporating error bars into visualizations, such as bar charts, line
graphs, or scatter plots, viewers can better understand the uncertainty or
variability of the data and make more informed interpretations of the
results.

Basic Errorbars
 A basic errorbar can be created with a single Matplotlib function call (Refer
Figure 2.19 & 2.20)
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

x = np.linspace(0, 10, 50)


dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');

Figure 2.19 - An errorbar example

 the fmt is a format code controlling the appearance of lines and points

plt.errorbar(x, y, yerr=dy, fmt='o', color='black',


ecolor='lightgray', elinewidth=3, capsize=0);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 19


AD3301 DEV UNIT 2

Figure 2.20 - Customizing errorbars

Continuous Errors
 To show errorbars on continuous quantities (Figure 2.21).
from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)

# Compute the Gaussian process fit


gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4,
thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
plt.plot(xdata, ydata, 'or')

plt.plot(xfit, yfit, '-', color='gray')


plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
color='gray', alpha=0.2)
plt.xlim(0, 10);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 20


AD3301 DEV UNIT 2

Figure 2.21 - Representing continuous uncertainty with filled regions

 The fill_between function: pass an x value, then the lower y-bound, then the
upper y-bound, and the result is that the area between these regions is filled.

5. How to represent Density and Contour plots in Python using Matplotlib? Or


How do you visualize a Three-Dimensional Function in python? Illustrate
with a code. (NOV/DEC 2023),(NOV/DEC 2024)
Density and Contour Plots
 Contour plots also called level plots are a tool for doing multivariate analysis
and visualizing 3-D plots in 2-D space.
 A contour plot is a graphical technique for representing a 3-dimensional
surface by plotting constant z slices, called contours, on a 2-dimensional
format. That is, given a value for z, lines are drawn for connecting the (x,y)
coordinates where that z value occurs.
 There are three Matplotlib functions that can be used:
o plt.contour for contour plots,
o plt.contourf for filled contour plots,
o plt.imshow for showing images.
 Setting up the notebook for plotting and importing the functions:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np

plt.contour() function
 The matplotlib.pyplot.contour() are usually useful when Z = f(X, Y) i.e Z
changes as a function of input X and Y.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 21


AD3301 DEV UNIT 2

def f(x, y):


return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
 A contour plot can be created with the plt.contour function.
 It takes three arguments:
a grid of x values,
a grid of y values,
a grid of z values.
 The x and y values represent positions on the plot, and the z values will be
represented by the contour levels.
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

plt.contour(X, Y, Z, colors='black');

(Refer Figure 2.22)

Figure 2.22 - Visualizing three-dimensional data with contours

 By default when a single color is used, negative values are represented by


dashed lines, and positive values by solid lines.
 Alternatively, can color-code the lines by specifying a colormap with the cmap
argument.
plt.contour(X, Y, Z, 20, cmap='RdGy');
 Refer Figure 2.23

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 22


AD3301 DEV UNIT 2

Figure 2.23 - Visualizing three-dimensional data with colored contours

plt.contourf() function.
 A contourf()allows to draw filled contours.
 A plt.colorbar() command, automatically creates an additional axis with
labeled color information for the plot (Refer Figure 2.24):
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();

Figure 2.24 Visualizing three-dimensional data with filled contours


 The colorbar makes it clear that the black regions are “peaks,” while the red
regions are “valleys.”
 Drawbacks - The color steps are discrete rather than continuous, which is not
always what is desired.

plt.imshow() function
 The plt.imshow() function, interprets a two-dimensional grid of data as an image.
Figure 2.25 shows the result of the following code:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy')
plt.colorbar()

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 23


AD3301 DEV UNIT 2

plt.axis(aspect='image');

Figure 2.25 - Representing three-dimensional data as an image

There are a few potential gotchas with imshow(),


• plt.imshow() doesn’t accept an x and y grid, so manually specify the
extent [xmin, xmax, ymin, ymax] of the image on the plot.
 plt.imshow() by default follows the standard image array definition where
the origin is in the upper left, not in the lower left as in most contour
plots. This must be changed when showing gridded data.
 plt.imshow() will automatically adjust the axis aspect ratio to match the
input data;

6. Write a python program in Matplotlib to display histograms, binnings and


Density.
Histograms, Binnings, and Density
Histogram (NOV/DEC 2024)
 A histogram is a graph showing frequency distributions.
 It is a graph showing the number of observations within each given interval.
 The hist() function will use an array of numbers to create a histogram, the
array is sent into the function as an argument.
 Refer Figure 2.26

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 24


AD3301 DEV UNIT 2

Figure 2.26 - A simple histogram

 The hist() function has many options (Refer Figure 2.27):


plt.hist(data, bins=30, normed=True, alpha=0.5, histtype='stepfilled',
color='steelblue', edgecolor='none');

Figure 2.27 - A customized histogram

 The plt.hist doc string of histtype='stepfilled' along with some transparency


alpha is useful when comparing histograms of several distributions
(Figure 2.28):

x1 = np.random.normal(0, 0.8, 1000)


x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 25


AD3301 DEV UNIT 2

Figure 2.28-Over-plotting multiple histograms

 Count the number of points in a given bin.


counts, bin_edges = np.histogram(data, bins=5)
print(counts)
Output:
[12 190 468 301 29]
Two-Dimensional Histograms and Binnings
 Create histograms in two dimensions by dividing points among two
dimensional bins.
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d: Two-dimensional histogram (Figure 2.29):
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')

Figure 2.29 - A two-dimensional histogram with plt.hist2d

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 26


AD3301 DEV UNIT 2

plt.hexbin: Hexagonal binnings


 The two-dimensional histogram creates a tessellation of squares across the
axes.
 Matplotlib provides the plt.hexbin routine, which represents a two-
dimensional dataset binned within a grid of hexagons (Figure 2.30):
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')

Figure 2.30 - A two-dimensional histogram with plt.hexbin

7. Define Legend and write a python code to customize the plot legends.
Matplotlib.pyplot.legend()
 A legend is an area describing the elements of the graph.
 In the matplotlib library, there’s a function called legend() which is used to
place a legend on the axes.
 The legend function has an attribute called loc which is used to denote the
location of the legend.
 The default value of attribute loc is upper left. Other string, such as upper
right, lower left, lower right can also be used.
 Attributes of legend function are:
o fontsize: This is used to denote the font of the legend. It has a numeric
value.
o facecolor: This is used to denote the legend’s background color.
o edgecolor: This is used to denote the legends patch edge color.
o shadow: It is a boolean variable to denote whether to display shadow
behind the legend or not.
o title: This attribute denotes the legend’s title.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 27


AD3301 DEV UNIT 2

o numpoints: The number of marker points in the legend when creating


an entry for a Line2D (line).

Customizing Plot Legends


 The simplest legend can be created with the plt.legend() command, which
automatically creates a legend for any labeled plot elements (Figure
2.31):
import matplotlib.pyplot as plt
plt.style.use('classic')

%matplotlib inline
import numpy as np

x = np.linspace(0, 10, 1000)


fig, ax = plt.subplots()
ax.plot(x, np.sin(x), '-b', label='Sine')
ax.plot(x, np.cos(x), '--r', label='Cosine')
ax.axis('equal')
leg = ax.legend();

Figure 2.31 - A default plot legend

 Can specify the location and turn off the frame (Figure 2.32):
ax.legend(loc='upper left', frameon=False)
fig

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 28


AD3301 DEV UNIT 2

Figure 2.32 - A customized plot legend

 Use the ncol command to specify the number of columns in the legend
(Figure 2.33):
ax.legend(frameon=False, loc='lower center', ncol=2)
fig

Figure 2.33 - A two-column plot legend

Multiple Legends
 Use the lower-level ax.add_artist() method to manually add the second
artist to the plot (Figure 2.34):

fig, ax = plt.subplots()
lines = []
styles = ['-', '--', '-.', ':']
x = np.linspace(0, 10, 1000)
for i in range(4):
lines += ax.plot(x, np.sin(x - i * np.pi / 2),
styles[i], color='black')
ax.axis('equal')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 29


AD3301 DEV UNIT 2

# specify the lines and labels of the first legend


ax.legend(lines[:2], ['line A', 'line B'],
loc='upper right', frameon=False)

# Create the second legend and add the artist manually.


from matplotlib.legend import Legend
leg = Legend(ax, lines[2:], ['line C', 'line D'],
loc='lower right', frameon=False)
ax.add_artist(leg);

Figure 2.34 - A split plot legend

8. Write a python code to customize Color bar using Matplotlib. (NOV/DEC 2024)

 A colorbar is a separate axes that can provide a key for the meaning of
colors in a plot.
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
 The simplest colorbar can be created with the plt.colorbar function
(Figure 2.35):
x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:, np.newaxis])
plt.imshow(I)
plt.colorbar();

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 30


AD3301 DEV UNIT 2

Figure 2.35 - A simple colorbar legend

Customizing Colorbars

 Specify the colormap using the cmap argument to the plotting function

plt.imshow(I, cmap='gray');

Figure 2.36 - A grayscale colormap

Choosing the colormap

Three different categories of colormaps:


 Sequential colormaps
These consist of one continuous sequence of colors (e.g., binary or
viridis).
 Divergent colormaps
These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).
 Qualitative colormaps

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 31


AD3301 DEV UNIT 2

These mix colors with no particular sequence (e.g., rainbow or jet).

Discrete colorbars

 Use the plt.cm.get_cmap() function, and pass the name of a suitable


colormap along with the number of desired bins (Figure 2.37):
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);

Figure 2.37 - A discretized colormap

9. Write a python program to create a subplot in python using Matplotlib.


(NOV/DEC 2024)
 Subplots:
 Groups of smaller axes that can exist together within a single figure.
 These subplots might be insets, grids of plots, or other more complicated
layouts.
 The matplotlib. pyplot. subplots method provides a way to plot multiple
plots on a single figure.
 Given the number of rows and columns , it returns a tuple ( fig , ax ),
giving a single figure fig with an array of axes ax .
 Notebook for plotting and importing the functions
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 32


AD3301 DEV UNIT 2

import numpy as np
 plt.axes: Subplots
 The most basic method of creating an axes is to use the plt.axes function.
 plt.axes also takes an optional argument that is a list of four numbers in
the figure coordinate system.
 These numbers represent [bottom, left, width, height] in the figure
coordinate system, which ranges from 0 at the bottom left of the figure to
1 at the top right of the figure.
ax1 = plt.axes() # standard axes
ax2 = plt.axes([0.65, 0.65, 0.2, 0.2])
 Output – Refer Figure 2.38

Figure 2.38 - Example of an inset axes

 For example, create an inset axes at the top-right corner of another axes
by setting the x and y position to 0.65 (that is, starting at 65% of the width
and 65% of the height of the figure) and the x and y extents to 0.2 (that is,
the size of the axes is 20% of the width and 20% of the height of the figure).

 plt.subplot: Simple Grids of Subplots


 plt.subplot(), creates a single subplot within a grid.
 This function takes three integer arguments—the number of rows, the
number of columns, and the index of the plot to be created in this scheme,
which runs from the upper left to the bottom right (Figure 2.39):
for i in range(1, 7):
plt.subplot(2, 3, i)
plt.text(0.5, 0.5, str((2, 3, i)),fontsize=18,
ha='center')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 33


AD3301 DEV UNIT 2

Figure 2.39 - A plt.subplot() example


 plt.subplots_adjust
 The command plt.subplots_adjust can be used to adjust the spacing
between these plots.
fig = plt.figure()
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i in range(1, 7):
ax = fig.add_subplot(2, 3, i)
ax.text(0.5, 0.5, str((2, 3, i)),fontsize=18, ha='center')
 Output – Refer Figure 2.40

Figure 2.40 - plt.subplot() with adjusted margins

 hspace and wspace arguments of plt.subplots_adjust, which specify the


spacing along the height and width of the figure, in units of the subplot
size.

 plt.subplots:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 34


AD3301 DEV UNIT 2

 plt.subplots() creates a full grid of subplots in a single line, returning them


in a NumPy array.
 The arguments are the number of rows and number of columns, along with
optional keywords sharex and sharey.
 Will create a 2×3 grid of subplots, where all axes in the same row share
their y-axis scale, and all axes in the same column share their x-axis scale
by specifying sharex and sharey (Figure 2.41):
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')

Figure 2.41 - Shared x and y axis in plt.subplots()

 plt.GridSpec:
 The plt.GridSpec() object is an interface that is recognized by the
plt.subplot() command.
 For example, a gridspec for a grid of two rows and three columns with
some specified width and height space is:
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 35


AD3301 DEV UNIT 2

Figure 2.42 - Irregular subplots with plt.GridSpec

10. Explain in detail about text and annotation in python using Matplotlib.
 Adding texts in matplotlib with text and fig text,( APR /MAY
2024),(NOV/DEC 2024).(Apr/May 2025)
 Can add texts over matplotlib charts by using

the text and figtext functions as shown in figure 2.43 a & b.

 The main difference between these two functions is that the first can
be used to add texts inside the plot axes while the second can be used
to add text to the figure.

Figure 2.43a – Graph with text Figure 2.43b – Graph with figtext

 Annotations and arrows with annotate

 The annotate function provides parameters to annotate specific parts

of the plot with arrows. Refer Figure 2.44.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 36


AD3301 DEV UNIT 2

Figure 2.44 – Graph with annotations

matplotlib.pyplot.annotate() Function
 The annotate() function in pyplot module of matplotlib library is used
to annotate the point xy with text s.
matplotlib.pyplot.annotate(text, xy, xytext=None, xycoords='data', te
xtcoords=None, arrowprops=None, annotation_clip=None, **kwargs)
Parameters:
 Text - str
o The text of the annotation.
 xy - (float, float)
o The point (x, y) to annotate. The coordinate system is
determined by xycoords.
 Xytext - (float, float), default: xy
o The position (x, y) to place the text at. The coordinate system is
determined by textcoords.
 Xycoords - single or two-tuple of str

 Transforms and Text Position


There are three predefined transforms:
 ax.transData
Transform associated with data coordinates
 ax.transAxes
Transform associated with the axes
 fig.transFigure
o Transform associated with the figure

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 37


AD3301 DEV UNIT 2

Example - Figure 2.45


fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);

Figure 2.45 - Comparing Matplotlib’s coordinate systems

 By default, the text is aligned above and to the left of the specified
coordinates;
 the “.” at the beginning of each string will approximately mark the given
coordinate location.
 The transData coordinates give the usual data coordinates associated with
the x- and y-axis labels.
 The transAxes coordinates give the location from the bottom-left corner of
the axes (here the white box) as a fraction of the axes size.
 The transfigure coordinates are similar, but specify the position from the
bottom left of the figure (here the gray box) as a fraction of the figure size.
 If the axes limits are changed, it is only the transData coordinates that
will be affected, while the others remain stationary

 Arrows and Annotation


%matplotlib inline
fig, ax = plt.subplots()
x = np.linspace(0, 20, 1000)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 38


AD3301 DEV UNIT 2

ax.plot(x, np.cos(x))
ax.axis('equal')
ax.annotate('local maximum', xy=(6.28, 1), xytext=(10, 4),
arrowprops=dict(facecolor='black', shrink=0.05))
ax.annotate('local minimum', xy=(5 * np.pi, -1), xytext=(2, -6),
arrowprops=dict(arrowstyle="->",
connectionstyle="angle3,angleA=0,angleB=-90"));
 Output – Refer Figure 2.46

Figure 2.46 - Annotation examples

11. Explain in detail about customizing matplotlib - Configurations and


Stylesheets in python. (NOV/DEC 2024)
Plot Customization
Default Histogram – Refer Figure 2.47

import matplotlib.pyplot as plt


plt.style.use('classic')
import numpy as np
%matplotlib inline
x = np.random.randn(1000)
plt.hist(x);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 39


AD3301 DEV UNIT 2

Figure 2.47 - A histogram in Matplotlib’s default style

Changing the Defaults: rcParams


 Each time Matplotlib loads, it defines a runtime configuration (rc)
containing the default styles for every plot element.
 Can adjust this configuration at any time using the plt.rc convenience
routine.
IPython_default = plt.rcParams.copy()
 Can use the plt.rc function to change some of these settings:
from matplotlib import cycler
colors = cycler('color',['#EE6666', '#3388BB', '#9988DD',
'#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6', edgecolor='none',
axisbelow=True, grid=True, prop_cycle=colors)
plt.rc('grid', color='w', linestyle='solid')
plt.rc('xtick', direction='out', color='gray')
plt.rc('ytick', direction='out', color='gray')
plt.rc('patch', edgecolor='#E6E6E6')
plt.rc('lines', linewidth=2)
plt.hist(x);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 40


AD3301 DEV UNIT 2

Figure 2.48 - A customized histogram using rc settings

Stylesheets
 Style module, includes a number of new default stylesheets, as well as the
ability to create and package own styles.
 These stylesheets are formatted and named with a .mplstyle extension.
 The available styles are listed in plt.style.available
plt.style.available[:5]

Output
['fivethirtyeight',
'seaborn-pastel',
'seaborn-whitegrid',
'ggplot',
'grayscale']

 The basic way to switch to a stylesheet is to call:


plt.style.use('stylename')
 Use the style context manager, which sets a style temporarily:
with plt.style.context('stylename'):
make_a_plot()

 Create a function that will make two basic types of plot:


def hist_and_lines():
np.random.seed(0)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 41


AD3301 DEV UNIT 2

fig, ax = plt.subplots(1, 2, figsize=(11, 4))


ax[0].hist(np.random.randn(1000))
for i in range(3):
ax[1].plot(np.random.rand(10))
ax[1].legend(['a', 'b', 'c'], loc='lower left')

hist_and_lines()

Output – Refer Figure 2.49

Figure 2.49 - A customized histogram using rc settings

FiveThirtyEight style
 The FiveThirtyEight style is typified by bold colors, thick lines, and
transparent axes.
with plt.style.context('fivethirtyeight'):
hist_and_lines()
 Output – Refer Figure 2.50

Figure 2.50 - The FiveThirtyEight style

ggplot

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 42


AD3301 DEV UNIT 2

 The ggplot style mimics the default styles from the package (Figure 2.51):
with plt.style.context('ggplot'):
hist_and_lines()

Figure 2.51 - The ggplot style

Dark background
 The dark_background style provides this (Figure 2.52):
with plt.style.context('dark_background'):
hist_and_lines()

Figure 2.52 - The dark_background style

Grayscale
 Preparing figures for a print publication that does not accept color
figures.
 For this, the grayscale style, shown in Figure 2.53, can be used:
with plt.style.context('grayscale'):
hist_and_lines()

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 43


AD3301 DEV UNIT 2

Figure 2.53 - The grayscale style

Seaborn style - Figure 2.54


import seaborn
hist_and_lines()

Figure 2.54 - The grayscale style

12. Explain in detail about three-dimensional plotting with suitable python code and
output.(NOV/DEC 2024)
Three-Dimensional Plotting in Matplotlib
 Enables three-dimensional plots by importing the mplot3d toolkit, included with
the main Matplotlib installation (Figure 2.55):
from mpl_toolkits import mplot3d

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 44


AD3301 DEV UNIT 2

 Once this submodule is imported, can create a three-dimensional axes by passing


the keyword projection='3d' to any of the normal axes creation routines:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')

Figure 2.55 - An empty three-dimensional axes

Three-Dimensional Points and Lines


 The most basic three-dimensional plot is a line or scatter plot created from
sets of (x, y, z) triples created using the ax.plot3D and ax.scatter3D
functions.
 Plot a trigonometric spiral, along with some points drawn randomly near
the line (Figure 2.56):
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 45


AD3301 DEV UNIT 2

Figure 2.56 - Points and lines in three dimensions

Three-Dimensional Contour Plots


 ax.contour3D requires all the input data to be in the form of two-
dimensional regular grids, with the Z data evaluated at each point.
 Example - A three-dimensional contour diagram of a three dimensional
sinusoidal function (Figure 2.57):

def f(x, y):


return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 46


AD3301 DEV UNIT 2

Figure 2.57 - A three-dimensional contour plot

 The view_init method is used to set the elevation and azimuthal angles.
 Shown in Figure 2.58, uses an elevation of 60 degrees (that is, 60 degrees
above the x-y plane) and an azimuth of 35 degrees (that is, rotated 35
degrees counter-clockwise about the z-axis):
ax.view_init(60, 35)
fig

Figure 2.58 - Adjusting the view angle for a three-dimensional plot

Wireframes and Surface Plots


 Two other types of three-dimensional plots that work on gridded data are
wireframes and surface plots.
 These take a grid of values and project it onto the specified three
dimensional surface, and make the resulting three-dimensional forms
quite easy to visualize.
 An example using a wireframe (Figure 2.59):

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 47


AD3301 DEV UNIT 2

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');

Figure 2.59 - A wireframe plot

 A surface plot is like a wireframe plot, but each face of the wireframe is a
filled polygon.
 Adding a colormap to the filled polygons can aid perception of the topology
of the surface being visualized (Figure 2.60):

ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface');

Figure 2.60 - A three-dimensional surface plot

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 48


AD3301 DEV UNIT 2

Surface Triangulations
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5);

Output – Refer Figure 2.81

Figure 2.81 - A three-dimensional sampled surface

13. Write a python code to display Geographic Data with


Basemap.(Apr/May 2025)(Nov/Dec 2024)
Basemap
 Basemap is a matplotlib extension used to visualize and create
geographical maps in python.
Geographic Data with Basemap
 Installation of Basemap
$ conda install basemap

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 49


AD3301 DEV UNIT 2

m = Basemap(projection='ortho', resolution=None, lat_0=50,


lon_0=-100)
m.bluemarble(scale=0.5);

 Bluemarble Image - Refer Figure 2.82

Figure 2.82 - A “bluemarble” projection of the Earth

 An Etopo Image - shows topographical features both on land and under


the ocean as the map background, Refer Figure 2.83.

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6, lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 50


AD3301 DEV UNIT 2

Figure 2.83 - Plotting data and labels on the map

Map Projections
from itertools import chain

def draw_map(m, scale=0.2):


# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style


for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

Cylindrical projections

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 51


AD3301 DEV UNIT 2

 The simplest of map projections are cylindrical projections, in which lines


of constant latitude and longitude are mapped to horizontal and vertical
lines, respectively.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

 Other cylindrical projections are the Mercator (projection='merc') and the


cylindrical equal-area (projection='cea') projections. Refer Figure 2.84.

Figure 2.84 - Cylindrical equal-area projection

 The additional arguments to Basemap for this view specify the latitude
(lat) and longitude (lon) of the lower-left corner (llcrnr) and upper-right
corner (urcrnr) for the desired map, in units of degrees.

Pseudo-cylindrical projections
 Pseudo-cylindrical projections can give better properties near the poles of
the projection.
 The Mollweide projection (projection='moll') is one common example of
this, in which all meridians are elliptical arcs (Figure 2.85).
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 52


AD3301 DEV UNIT 2

Figure 2.85 - The Molleweide projection


 Other pseudo-cylindrical projections are the sinusoidal (projection='sinu')
and Robinson (projection='robin') projections.
 The extra arguments to Basemap here refer to the central latitude (lat_0)
and longitude (lon_0) for the desired map.

Perspective projections
 Perspective projections are constructed using a particular choice of
perspective point.
 One common example is the orthographic projection (projection='ortho'),
which shows one side of the globe as seen from a viewer at a very long
distance. Thus, it can show only half the globe at a time.
 Other perspective-based projections include the gnomonic projection
(projection='gnom') and stereographic projection (projection='stere').
 Here is an example of the orthographic projection (Figure 2.86):

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 53


AD3301 DEV UNIT 2

Figure 2.86 - The orthographic projection

Conic projections
 A conic projection projects the map onto a single cone, which is then
unrolled.
 One example of this is the Lambert conformal conic projection
(projection='lcc').
 Other useful conic projections are the equidistant conic (projection='eqdc')
and the Albers equal-area (pro jection='aea') projection (Figure 2.87).
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 54


AD3301 DEV UNIT 2

Figure 2.87 - The Albers equal-area projection

Drawing a Map Background


 Physical boundaries and bodies of water
o drawcoastlines() - Draw continental coast lines
o drawlsmask() - Draw a mask between the land and sea, for use
with projecting images on one or the other
o drawmapboundary() - Draw the map boundary, including the fill color
for oceans
o drawrivers() - Draw rivers on the map
o fillcontinents() - Fill the continents with a given color; optionally
fill lakes with another color
• Political boundaries
o drawcountries() - Draw country boundaries
o drawstates() - Draw US state boundaries
o drawcounties() - Draw US county boundaries
• Map features
o drawgreatcircle() - Draw a great circle between two points
o drawparallels() - Draw lines of constant latitude
o drawmeridians() - Draw lines of constant longitude
o drawmapscale() - Draw a linear scale on the map

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 55


AD3301 DEV UNIT 2

• Whole-globe images
o bluemarble() - Project NASA’s blue marble image onto the map
o shadedrelief() - Project a shaded relief image onto the map
o etopo() - Draw an etopo relief image onto the map
o warpimage() - Project a user-provided image onto the map

 The resolution argument of the Basemap class sets the level of detail
in boundaries, either 'c' (crude), 'l' (low), 'i' (intermediate), 'h' (high), 'f'
(full), or None if no boundaries will be used.

Plotting Data on Maps


 Some of these map-specific methods are:
o contour()/contourf() - Draw contour lines or filled contours
o imshow() - Draw an image
o pcolor()/pcolormesh() - Draw a pseudocolor plot for
irregular/regular meshes
o plot() - Draw lines and/or markers
o scatter() - Draw points with markers
o quiver() - Draw vectors
o barbs() - Draw wind barbs
o drawgreatcircle() - Draw a great circle

14. Write a python code to display Visualization with Seaborn. Or Describe


the various distributions module of Seaborn for visualization. Consider
a sample application to illustrate. (NOV/DEC 2023) , (NOV/DEC 2024)
Seaborn
 Seaborn is a library for making statistical graphics in Python. It builds on
top of matplotlib and integrates closely with pandas data structures.
 Seaborn is a library that uses Matplotlib underneath to plot graphs. It will
be used to visualize random distributions.
 Seaborn is an amazing visualization library for statistical graphics plotting
in Python.
 It provides beautiful default styles and color palettes to make statistical
plots more attractive.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 56


AD3301 DEV UNIT 2

Different categories of plot in Seaborn


 Relational plots: This plot is used to understand the relation between two
variables.
 Categorical plots: This plot deals with categorical variables and how they
can be visualized.
 Distribution plots: This plot is used for examining univariate and
bivariate distributions
 Regression plots: The regression plots in Seaborn are primarily intended
to add a visual guide that helps to emphasize patterns in a dataset during
exploratory data analyses.
 Matrix plots: A matrix plot is an array of scatterplots.
 Multi-plot grids: It is a useful approach to draw multiple instances of the
same plot on different subsets of the dataset.

Histplot
 Seaborn Histplot is used to visualize the univariate set of
distributions(single variable).
 It plots a histogram, with some other variations like kdeplot and rugplot.
 Refer Figure 2.88.

import numpy as np
import seaborn as sns

sns.set(style="white")

# Generate a random univariate dataset


rs = np.random.RandomState(10)
d = rs.normal(size=100)

# Plot a simple histogram and kde


sns.histplot(d, kde=True, color="m")

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 57


AD3301 DEV UNIT 2

Figure 2.88 - Histogram with seaborn

Distplot:
 Seaborn distplot is used to visualize the univariate set of distributions
(Single features) and plot the histogram with some other variations like
kdeplot and rugplot.
 Refer Figure 2.89

import numpy as np
import seaborn as sns
sns.set(style="white")

# Generate a random univariate dataset


rs = np.random.RandomState(10)
d = rs.normal(size=100)

# Define the colors to use


colors = ["r", "g", "b"]

# Plot a histogram with multiple colors


sns.distplot(d, kde=True, hist=True, bins=10,
rug=True,hist_kws={"alpha": 0.3,"color": colors[0]},
kde_kws={"color": colors[1], "lw": 2},
rug_kws={"color": colors[2]})

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 58


AD3301 DEV UNIT 2

Figure 2.89 - Distplot using seaborn

Lineplot:
 The line plot is one of the most basic plots in the seaborn library.
 This plot is mainly used to visualize the data in the form of some time
series, i.e. in a continuous manner.
 Refer Figure 2.90

import seaborn as sns


sns.set(style="dark")
fmri = sns.load_dataset("fmri")

# Plot the responses for different events and regions


sns.lineplot(x="timepoint",
y="signal",
hue="region",
style="event",
data=fmri)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 59


AD3301 DEV UNIT 2

Figure 2.90 - Lineplot using seaborn

Lmplot:
 The lmplot is another most basic plot.
 It shows a line representing a linear regression model along with data
points on the 2D space and x and y can be set as the horizontal and vertical
labels respectively.
 Refer Figure 2.91

import seaborn as sns

sns.set(style="ticks")

# Loading the dataset


df = sns.load_dataset("anscombe")

# Show the results of a linear regression


sns.lmplot(x="x", y="y", data=df)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 60


AD3301 DEV UNIT 2

Figure 2.91 - Lmplot using seaborn

Factor plots
 Factor plots allows to view the distribution of a parameter within bins
defined by any other parameter with (Figure 2.92):
sns.axes_style(style='ticks'):
g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
g.set_axis_labels("Day", "Total Bill");

Figure 2.92 - An example of a factor plot

Joint distributions
 To show the joint distribution between different datasets, along with the
associated marginal distributions (Figure 2.93):
with sns.axes_style('white'):
sns.jointplot("total_bill", "tip", data=tips, kind='hex')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 61


AD3301 DEV UNIT 2

Figure 2.93 - A joint distribution plot

Bar plots
 Time series can be plotted with sns.factorplot (Figure 2.94).
planets = sns.load_dataset('planets')
planets.head()
 Output
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
with sns.axes_style('white'):
g = sns.factorplot("year", data=planets, aspect=2,
kind="count", color='steelblue')
g.set_xticklabels(step=5)

Figure 2.94 - A histogram as a special case of a factor plot

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 62


AD3301 DEV UNIT 2

IMPORTANT QUESTIONS
PART – A

1. What is a Matplotlib? Or What is Matplotlib used for? (NOV/DEC 2023)

PART – B
1. How to represent Density and Contour plots in Python using Matplotlib?
Or How do you visualize a Three-Dimensional Function in python?
Illustrate with a code. (NOV/DEC 2023)

PART – C
1. Write a python code to display Visualization with Seaborn. Or Describe
the various distributions module of Seaborn for visualization. Consider a
sample application to illustrate. (NOV/DEC 2023)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. & Head/ AI&DS 63


AD3301 DEV UNIT 3

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / III SEM
AD3301 DATA EXPLORATION AND VISUALIZATION
SYLLABUS
UNIT III
UNIVARIATE ANALYSIS

SYLLABUS: Introduction to Single variable: Distributions and Variables –


Numerical Summaries of Level and Spread – Scaling and Standardizing –
Inequality – Smoothing Time Series.

PART A
1. Define Variable and Single Variable.
 A data set is usually a rectangular array of data, with variables in columns and
observations in rows.
 A variable is a characteristic or property that can take on different values.
 A variable (or field or attribute) is a characteristic of members of a population,
such as height, gender, or salary.
 Single variable data is usually called univariate data.
 Single variable data is used to describe a type of data that consists of
observations on only a single characteristic or attribute.

2. List the types of variable.


 Quantitative variable - Quantitative data is numbers-based, countable, or
measurable. Quantitative data tells us how many, how much, or how often in
calculations.
 Qualitative Variable - Qualitative data is interpretation-based, descriptive,
and relating to language. Qualitative data can help us to understand why, how,
or what happened behind certain behaviors.

3. Define Discrete and Continuous Variables.


 Quantitative variables can be divided into
o Discrete

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 1


AD3301 DEV UNIT 3

o Continuous
 A discrete variable consists of isolated numbers separated by gaps.
Example
o counts, such as the number of children in a family;
o The number of foreign countries you have visited;
 A continuous variable consists of numbers whose values have no
restrictions.
Example
o amounts, such as weights of male statistics students;
o durations, such as the reaction times of grade school children to a fire
alarm;

4. List the Descriptive Measures for Univariate Data Analysis.


 Univariate data analysis involves using statistical measures such as
Measures of Central Tendency.
 The various numerical summary measures can be categorized into several
groups:
 measures of central tendency;
o mean
o median
o mode
 minimum, maximum, percentiles, and quartiles;
 measures of variability;
o range
o IQR
o Variance
o Standard deviation
 measures of shape

5. Define Mean, Median and Mode.


Mean
 The mean is the average of all values. sample mean and is denoted by
(pronounced “X-bar”). population mean is denoted by μ (the Greek
letter mu)
The mean is found by adding all scores and then dividing by the number
of scores.
Mean =sum of all scores/number of scores

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 2


AD3301 DEV UNIT 3

Types of mean
 population mean
 sample mean
Median
 The median is the middle observation when the data are sorted from
smallest to largest. If the number of observations is odd, the median is
literally the middle observation. If the number of observations is even, the
median is usually defined as the average of the two middle observations
Mode
 The mode reflects the value of the most frequently occurring score.
Distributions with two obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.

6. What is Minimum, maximum, percentiles, and quartiles?


 Cumulative percentages are referred to as percentile ranks.
 The percentile rank of a score indicates the percentage of scores in the
entire distribution with similar or smaller values than that score.
 For any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it.

7. What are the different Measures of variability?


 The range is a crude measure of variability. It is defined as the maximum
value minus the minimum value.
 A less sensitive measure is the interquartile range (abbreviated IQR). It is
defined as the third quartile minus the first quartile,

8. What are the Charts suitable for numerical variables?


 Histograms and Box-plots for cross-sectional variables-
 Box plots and histograms are complementary ways of displaying the
distribution of a numerical variable.
 Side-by-side box plots are very useful for comparing two or more
populations.
 The whole purpose of time series graphs is to detect historical patterns in
the data.
 An outlier is literally a value or an entire observation (row) that lies well
outside of the norm.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 3


AD3301 DEV UNIT 3

9. List the representations for Distribution of Data and Variables.


 Histograms.
 Frequency distribution.
 Box plots.
 Pie charts.

10. Define Histograms.


 Histograms are one of the most commonly used graphs to show frequency
distribution.
 It is a graphical display of data using bars of different heights.
 The histogram groups numbers into ranges.
 It is an appropriate way to display single-variable data.

11. Define frequency distribution and mention its types.


 A frequency distribution is a collection of observations produced by sorting
observations into classes and showing their frequency (f) of occurrence in
each class.
Types of Frequency distribution
1. Relative Frequency Distributions
• Relative frequency distributions shows the frequency of each class as a
part or fraction of the total frequency for the entire distribution.
2. Cumulative Frequency Distributions
 Cumulative frequency distributions show the total number of
observations in each class and in all lower-ranked classes.

12. Define Numerical Summaries.


 A numerical summary is a number used to describe a specific
characteristic about a data set.

13. What are the Measures of Location and measures of spread?


Measures of Location
 These are also referred to as measures of centrality or averages.
 There are three measures: the mean, the median and the mode.
Measures of Spread
 Defines variation in the data.
 There are three basic measures of spread:
 the range,
 the inter–quartile range
 the sample variance.
 Boxplots

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 4


AD3301 DEV UNIT 3

14. Define Boxplots.


Box plots
 Presenting data using the box plot gives a good graphical image of the
concentration of the data.
 It displays the five-number summary of a dataset; the minimum, first
quartile, median, third quartile, and maximum.

15. What is Univariate Analysis? Or List the three main types of univariate
analyses. (NOV/DEC 2023)
 The term univariate analysis refers to the analysis of one variable.
 There are three common ways to perform univariate analysis on one
variable:
o Summary statistics – Measures the center and spread of values.
o Frequency table – Describes how often different values occur.
o Charts – Used to visualize the distribution of values.

16. What are the basis for data analysis?


 The basics of data analysis involve retrieving and gathering large volumes
of data, organizing it, and turning it into insights businesses can use to
make better decisions and reach conclusions.

17. Distinguish bar chart and pie charts.


 Pie charts represent data in a circle, with “slices” corresponding to
percentages of the whole, whereas bar graphs use bars of different lengths
to represent data in a more flexible way.

18. Define level and spread in data exploration.


Spread
 How far the data values are from the mean or the median of the data set is
called the spread of the data.
 There are different four measures of the spread of data.
 Range: the difference between the maximum and minimum data values
interquartile range or IQR: the difference between the upper quartile and
lower quartile
 Mean deviation: mean of the deviations from the mean of the data set
 Standard deviation: the amount of variation or dispersion from the mean
of the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 5


AD3301 DEV UNIT 3

19. What is a mid spread?


 Spread of data (also known as variation, fluctuation, dispersion, etc.) is
the measure of how far the data ranges from the center of data (mean or
the median).
 Range, interquartile range, mean deviation, and standard
deviation are the measures of the spread of the data.

20. Differentiate scaling and standardizing.


 Feature scaling is a data preprocessing technique used to transform the
values of features or variables in a dataset to a similar scale. The purpose
is to ensure that all features contribute equally to the model and to avoid
the domination of features with larger values.
 Standardization is another scaling method where the values are centered
around the mean with a unit standard deviation. This means that the mean
of the attribute becomes zero, and the resultant distribution has a unit
standard deviation.

21. Define a Gaussian Distribution.


 In statistics, a normal distribution or Gaussian distribution is a type of
continuous probability distribution for a real-valued random variable. The
general form of its probability density function is Normal distribution.

22. Define Smoothing time series. Or What is the purpose of smoothing a


time series data? (NOV/DEC 2023)
 The process of reducing the noise from time-series data by averaging the
data points with their neighbors is called smoothing. There are many
techniques to reduce the noise like simple moving average, weighted
moving average, kernel smoother, etc.

23. What is Gini Coefficient?


 The Gini coefficient is an index for the degree of inequality in the
distribution of income/wealth, used to estimate how far a country's wealth
or income distribution deviates from an equal distribution.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 6


AD3301 DEV UNIT 3

24. Why there is a need to reduce inequality?


 Reducing inequality requires transformative change. This matters
because elevated levels of inequality are harmful for the pace and
sustainability of growth.
25. Difference between normalized scaling and standardized scaling.
(APR/MAY 2024)
 Normalization and standardization are both feature scaling techniques
used in data preprocessing, but they differ in their approach and the types
of data they are best suited for.
 Normalization scales data to a specific range, typically between 0 and 1,
while standardization transforms data to have a mean of 0 and a standard
deviation of 1.
26. Illustrate important steps to be followed in preparing a base map.
(APR/MAY 2024)

 Define the Purpose and Area of Interest


 Collect Source Data
 Choose an Appropriate Scale
 Select a Coordinate System and Projection
 Digitize and Compile Map Features
 Clean and Edit the Data
 Add Cartographic Elements
 Verify and Validate
 Final Output and Export
 Documentation and Archiving
27. How does a continuous variable differ from a discrete variable?
(Apr/May 2025)

Continuous Variable:

 Takes infinite values within a given range.


 Can be measured, not counted.
 Examples:
o Height (e.g., 170.5 cm, 171.234 cm)
o Temperature (e.g., 36.6°C, 98.4°F)
o Time (e.g., 2.1 seconds, 3.1415 seconds)
 Can have fractions or decimals.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 7


AD3301 DEV UNIT 3

Discrete Variable:

 Takes a finite or countable number of values.


 Can be counted, not measured.
 Examples:
o Number of students in a class (e.g., 25, 30)
o Number of cars in a parking lot (e.g., 10, 20)
o Number of goals in a football match (e.g., 0, 1, 2)
 Cannot have fractional values (you can’t have 2.5 cars).

28. Which two categories of variable are the most common? Give a specific
illustration of each. (Apr/May 2025)

The two most common categories of variables are:

1. Quantitative Variables

 These represent numerical values and can be either discrete or continuous.


 They allow for mathematical operations like addition, subtraction, averaging,
etc.

Example:

 Age of a person:
o A person might be 25 years old (quantitative continuous).
 Number of children in a family:
o A family might have 3 children (quantitative discrete).

2. Qualitative Variables (Categorical Variables)

 These represent categories or labels and cannot be meaningfully counted or


measured.
 They describe qualities or characteristics.

Example:

 Gender: Male or Female.


 Type of vehicle: Car, Bike, Bus, Truck.
 Blood group: A, B, AB, O.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 8


AD3301 DEV UNIT 3

PART B

1. Give a brief introduction about single variable Distributions and Variables.


 Single Variable
 Types of variable
 Discrete and Continuous Variables
 Descriptive Measures:
o measures of central tendency;
o minimum, maximum, percentiles, and quartiles;
o measures of variability;
o measures of shape
 Charts for numerical variables:

 Single Variable
 Single variable data is usually called univariate data.
 Single variable data is used to describe a type of data that consists of
observations on only a single characteristic or attribute.
 A population includes all of the entities of interest. it is virtually impossible
to obtain information about all members of the population.
 Therefore, gain insights into the characteristics of a population by
examining a sample, or subset, of the population
 A data set is usually a rectangular array of data, with variables in columns
and observations in rows.
 A variable is a characteristic or property that can take on different values.
 A variable (or field or attribute) is a characteristic of members of a
population, such as height, gender, or salary.

 Types of variable
 Quantitative variable - Quantitative data is numbers-based, countable, or
measurable. Quantitative data tells us how many, how much, or how often
in calculations.
 Qualitative Variable - Qualitative data is interpretation-based, descriptive,
and relating to language. Qualitative data can help us to understand why,
how, or what happened behind certain behaviors.
 An observation (or case or record) is a list of all variable values for a single
member of a population.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 9


AD3301 DEV UNIT 3

 A variable is numerical if meaningful arithmetic can be performed on it.


 Otherwise, the variable is categorical.
 Third is a date variable.
 A categorical variable is ordinal if there is a natural ordering of its possible
categories. If there is no natural ordering, the variable is nominal.
 A dummy variable is a 0−1 coded variable for a specific category.
 It is coded as 1 for all observations in that category and 0 for all observations
not in that category.
 A binned (or discretized) variable corresponds to a numerical variable
that has been categorized into discrete categories. These categories are
usually called bins.

 Discrete and Continuous Variables


 Quantitative variables can be divided into
o Discrete
o Continuous
 A discrete variable consists of isolated numbers separated by gaps.
Example
o counts, such as the number of children in a family;
o The number of foreign countries you have visited;
 A continuous variable consists of numbers whose values have no
restrictions.
Example
o amounts, such as weights of male statistics students;
o durations, such as the reaction times of grade school children to a fire
alarm;
 Cross-sectional datasets are data on a cross section of a population at a
distinct point in time. Time series data are data collected over time.

 Descriptive Measures:
 Univariate data analysis involves using statistical measures such as
Measures of Central Tendency.
 The various numerical summary measures can be categorized into several
groups:
 measures of central tendency;
 minimum, maximum, percentiles, and quartiles;
 measures of variability;

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 10


AD3301 DEV UNIT 3

 measures of shape

Measures of central tendency:


Mean
 The mean is the average of all values. sample mean and is denoted by
(pronounced “X-bar”). population mean is denoted by μ (the Greek
letter mu)
The mean is found by adding all scores and then dividing by the number
of scores.
Mean =sum of all scores/number of scores
Example:
Find the mean for the following retirement ages: 60, 63, 45, 63, 65, 70,
55, 63, 60, 65, 63.

Types of mean
 population mean
 sample mean
 Population is a complete set of scores.
o Sample is a subset of scores.
Sample Mean
 The mean is found by dividing the sum for the values of all scores in the
sample by the number of scores in the sample.
 Sample Size (n): The total number of scores in the sample.

 “X-bar equals the sum of the variable X divided by the sample size n.”

Population Mean (μ)


 The balance point for a population, found by dividing the sum for all scores
in the population by the number of scores in the population.
 Population Size (N) The total number of scores in the population.
 The population mean is represented by μ (pronounced “mu”),

where the uppercase letter N refers to the population size.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 11


AD3301 DEV UNIT 3

Median
 The median is the middle observation when the data are sorted from
smallest to largest. If the number of observations is odd, the median is
literally the middle observation. If the number of observations is even, the
median is usually defined as the average of the two middle observations

Mode
 The mode reflects the value of the most frequently occurring score.
 Distributions with two obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.
 Distributions with more than two peaks are referred to as multimodal. Refer
Figure 3.1.

Figure 3.1 - Modes


Example:
Determine the mode for the following retirement ages: 60, 63, 45, 63,
65, 70, 55, 63, 60, 65, 63.
Answer - mode = 63

Minimum, maximum, percentiles, and quartiles;


 Cumulative percentages are referred to as percentile ranks.
 The percentile rank of a score indicates the percentage of scores in the
entire distribution with similar or smaller values than that score.
 For any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it.
 Similarly, the first, second, and third quartiles are the percentiles
corresponding to p=25% ,p=50% ,p=75%.
Measures of variability:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 12


AD3301 DEV UNIT 3

 The range is a crude measure of variability. It is defined as the maximum


value minus the minimum value.
 A less sensitive measure is the interquartile range (abbreviated IQR). It is
defined as the third quartile minus the first quartile, so it is really the range
of the middle 50% of the data

 The variance is essentially the average of the squared deviations from the
mean.
 If all observations are close to the mean, their squared deviations from the
mean will be relatively small, and the variance will be relatively small.
 If at least a few of the observations are far from the mean, their squared
deviations from the mean will be large, and this will cause the variance to
be large.
 Standard deviation - A measure is the square root of variance.
 There are two versions of standard deviation.
o The sample standard deviation, denoted by s
o The population standard deviation, denoted by sigma.

Measures of shape:
Positively Skewed Distribution
 A distribution that includes a few extreme observations in the positive
direction (to the right of the majority of observations).
 Figure 3.2 represents positively skewed distribution curve.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 13


AD3301 DEV UNIT 3

 If the skewness were due to really large values then skewed to the
right (or positively skewed).

Figure 3.2 - Positively Skewed Distribution

Negatively Skewed Distribution


 A distribution that includes a few extreme observations in the negative
direction (to the left of the majority of observations)
 Figure 3.3 represents positively skewed distribution curve.
 If the skewness were due to really small values then called as
skewness to the left (or negatively skewed).

Figure 3.3 Negatively Skewed Distribution

Charts for numerical variables:


 Histograms and Box-plots for cross-sectional variables-
 Box plots and histograms are complementary ways of displaying the
distribution of a numerical variable.
 Side-by-side box plots are very useful for comparing two or more
populations.
 The whole purpose of time series graphs is to detect historical patterns in
the data.
 An outlier is literally a value or an entire observation (row) that lies well
outside of the norm.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 14


AD3301 DEV UNIT 3

2. Discuss in detail about Distribution of Data and Variables or Single


variable data analysis. (APR/MAY 2024)

 Distribution of Data and Variables


o Histograms.
o Frequency distribution.
o Box plots.
o Pie charts.

Single variable data analysis


 The common way to display single-variable data is in a table, other common
ways are:
o Histograms.
o Frequency distribution.
o Box plots.
o Pie charts.
 Histograms
 Histograms are one of the most commonly used graphs to show
frequency distribution.
 It is a graphical display of data using bars of different heights.
 The histogram groups numbers into ranges.
 It is an appropriate way to display single-variable data.

Figure 3.4 Histogram

 Frequency distribution
 A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.

Frequency Distributions For Quantitative Data

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 15


AD3301 DEV UNIT 3

 When observations are sorted into classes of single values, result is


referred to as a frequency distribution for ungrouped data, refer
Table 3.1
 Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
 Otherwise, if there are 20 or more possible values, use a frequency
distribution for grouped data.
Example
 The numbers of newspapers sold at a shop over the last 10 days are;
20, 20, 25, 23, 20, 18, 22, 20, 18, 22.
 This can be represented by frequency distribution.

Table 3.1

Papers sold Frequency


18 2
19 0
20 4
21 0
22 2
23 1
24 0
25 1

Grouped Data
 Data are grouped into class intervals with 10 possible values each in Table
3.2.
 The bottom class includes the smallest observation (133), and the top class
includes the largest observation (245).
 The distance between bottom and top is occupied by an orderly series of
classes.
 The frequency ( f ) column shows the frequency of observations in each
class and, at the bottom, the total number of observations in all classes.
 Example – Refer Table 3.2

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 16


AD3301 DEV UNIT 3

Table 3.2

Guidelines For Frequency Distributions


Essential
1. Each observation should be included in one, and only one, class.
2. List all classes, even those with zero frequencies.
3. All classes should have equal intervals.
Optional
4. All classes should have both an upper boundary and a lower boundary.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . .
10, particularly 5 and 10 or multiples of 5 and 10.
6. The lower boundary of each class interval should be a multiple of the
class interval.
7. Aim for a total of approximately 10 classes.

Example

Constructing Frequency Distributions – Table 3.3

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 17


AD3301 DEV UNIT 3

1. Find the range, that is, the difference between the largest and smallest
observations.
Example- From Table 3.3:
The range of weights in the above table is 123 – 69 = 54.
2. Find the class interval required to span the range by dividing the
range by the desired number of classes (ordinarily 10).

Example: Class interval = 54/10=5.4


3. Round off to the nearest convenient interval (such as 1, 2, 3, . . .
10, particularly 5 or 10 or multiples of 5 or 10).
Example: nearest convenient interval = 5
4. Determine where the lowest class should begin. (Ordinarily, this
number should be a multiple of the class interval.)
Example: the smallest score is 69, and therefore the lowest class should
begin at 65, since 65 is a multiple of 5 (the class interval).
5. Determine where the lowest class should end by adding the class
interval to the lower boundary and then subtracting one unit of
measurement.
Example: Add 5 to 65 and then subtract 1, the unit of measurement,
to obtain 69—the number at which the lowest class should end.
6. Working upward, list as many equivalent classes as are required to
include the largest observation.
Example: 65-69, 70 – 74, . . . 120 - 124, so that the last class includes
123, the largest score.
7. Indicate with a tally the class in which each observation falls.
Example - the first score 160, produces a tally next to 160–169; the
next score, 193, produces a tally next to 190–199; and so on.
8. Replace the tally count for each class with a number—the
frequency (f)—and show the total of all frequencies. (Tally marks are
not usually shown in the final frequency distribution.)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 18


AD3301 DEV UNIT 3

Table 3.4

9. Supply headings for both columns and a title for the Table 3.4.
Types of Frequency distribution
1. Relative Frequency Distributions
• Relative frequency distributions shows the frequency of each class as a
part or fraction of the total frequency for the entire distribution.
Example – Table 3.5

2. Cumulative Frequency Distributions


 Cumulative frequency distributions show the total number of
observations in each class and in all lower-ranked classes.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 19


AD3301 DEV UNIT 3

Example – Table 3.6

Frequency Distributions For Qualitative (Nominal) Data


 When, among a set of observations, any single observation is a word, letter,
or numerical code, the data are qualitative.
Example:
Movie ratings reflect ordinal measurement because they can be ordered
from most to least restrictive: NC-17, R, PG-13, PG, and G.
The ratings of some films shown recently in San Francisco are as follows:

(a) Construct a frequency distribution.


(b) Convert to relative frequencies, expressed as percentages.
(c) Construct a cumulative frequency distribution.
(d) Find the approximate percentile rank for those films with a PG rating

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 20


AD3301 DEV UNIT 3

 Pie charts
• Pie charts are types of graphs that display data as circular graphs in
Figure 3.5.
 They are represented in slices where each slice of the pie is relative to
the size of that category in the group as a whole.
 This means that the entire pie is 100%, and each slice is its proportional
value.
 Example
Assuming the data for pets ownership in Lincoln were collected as
follows, how would it be represented on a pie chart?
Dogs - 1110 people
Cats - 987 people
Rodents - 312 people
Reptiles - 97 people
Fish - 398 people

Figure 3.5 Pie chart representing data of pets in Lincoln

 Box plots
 Presenting data using the box plot gives a good graphical image of the
concentration of the data.
 It displays the five-number summary of a dataset; the minimum, first
quartile, median, third quartile, and maximum.
Example - - Refer Figure 3.6
The ages of 10 students in grade 12 were collected and they are as
follows.
15, 21, 19, 19, 17, 16, 17, 18, 19, 18.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 21


AD3301 DEV UNIT 3

First, arrange this from lowest to highest so the median can be


determined.
15, 16, 17, 17, 18, 18, 19, 19, 19, 21
Median = 18

In finding the quartiles, the first will be the median to the right of the
overall median.
The median for 15, 16, 17, 17, 18 is 17

The third quartile will be the median to the right of the overall median.
Median for 18, 19, 19, 19, 21, will make 19.

The minimum number is 15, and also the maximum is 21.

Figure 3.6 Box plot representing students ages

3. Explain in detail about Numerical Summaries of Level and Spread.


 Numerical Summaries
 Measures of Location
o mean
o median
o mode
 Measures of Spread
o the range,
o the inter–quartile range
o the sample variance.
 Boxplots

 Numerical Summaries

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 22


AD3301 DEV UNIT 3

 A numerical summary is a number used to describe a specific


characteristic about a data set.
 Measures of Location
 These are also referred to as measures of centrality or averages.
 There are three measures: the mean, the median and the mode.
Mean
 The mean is the average of all values. sample mean and is denoted by
(pronounced “X-bar”). population mean is denoted by μ (the Greek
letter mu)
The mean is found by adding all scores and then dividing by the number
of scores.
Mean =sum of all scores/number of scores

Example:
Find the mean for the following retirement ages: 60, 63, 45, 63, 65, 70,
55, 63, 60, 65, 63.

Types of mean
 population mean
 sample mean
o Population is a complete set of scores.
o Sample is a subset of scores.
Sample Mean
 The mean is found by dividing the sum for the values of all scores in
the sample by the number of scores in the sample.
 Sample Size (n): The total number of scores in the sample.

(OR)

 “X-bar equals the sum of the variable X divided by the sample size n.”
Population Mean (μ)
 The balance point for a population, found by dividing the sum for all

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 23


AD3301 DEV UNIT 3

scores in the population by the number of scores in the population.


 Population Size (N) The total number of scores in the population.
 The population mean is represented by μ (pronounced “mu”),

where the uppercase letter N refers to the population size.

Median
 The median is the middle observation when the data are sorted from
smallest to largest. If the number of observations is odd, the median is
literally the middle observation. If the number of observations is even, the
median is usually defined as the average of the two middle observations

Mode
 The mode reflects the value of the most frequently occurring score.
Distributions with two obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.
Distributions with more than two peaks are referred to as multimodal. Refer
Figure 3.7.

Figure 3.7 - Modes


Example:
Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
Answer - mode = 63

 Measures of Spread
 Defines variation in the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 24


AD3301 DEV UNIT 3

 There are three basic measures of spread:


 the range,
 the inter–quartile range
 the sample variance.
 Boxplots
The Range
 It is the difference between the largest and smallest observations.
The Inter–Quartile Range
o The inter–quartile range describes the range of the middle half of the
data and so is less prone to the influence of the extreme values.
o To calculate the inter–quartile range (IQR) simply divide the ordered
data into four quarters.
o The three values that split the data into these quarters are called the
quartiles.
o The first quartile (lower quartile, Q1) has 25% of the data below it; the
second quartile (median, Q2) has 50% of the data below it; and the third
quartile (upper quartile, Q3) has 75% of the data below it.:

 The inter–quartile range is simply the difference between the upper and
lower quartiles, that is
IQR = Q3 − Q1
 The inter–quartile range is useful as it allows us to make comparisons
between the ranges of two data sets, without the problems caused by
outliers or uneven sample sizes.

The Sample Variance and Standard Deviation

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 25


AD3301 DEV UNIT 3

 The variance is essentially the average of the squared deviations from the
mean.
 If all observations are close to the mean, their squared deviations from the
mean will be relatively small, and the variance will be relatively small.
 If at least a few of the observations are far from the mean, their squared
deviations from the mean will be large, and this will cause the variance to
be large.
 Standard deviation - A measure is the square root of variance.
 There are two versions of standard deviation.
o The sample standard deviation, denoted by s
o The population standard deviation, denoted by sigma.

 Box plots
 Presenting data using the box plot gives a good graphical image of the
concentration of the data.
 It displays the five-number summary of a dataset; the minimum, first
quartile, median, third quartile, and maximum.
Example – Refer 3.8
The ages of 10 students in grade 12 were collected and they are as follows.
15, 21, 19, 19, 17, 16, 17, 18, 19, 18.

First, arrange this from lowest to highest so the median can be


determined.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 26


AD3301 DEV UNIT 3

15, 16, 17, 17, 18, 18, 19, 19, 19, 21


Median = 18

In finding the quartiles, the first will be the median to the right of the
overall median.
The median for 15, 16, 17, 17, 18 is 17

The third quartile will be the median to the right of the overall median.
Median for 18, 19, 19, 19, 21, will make 19.

The minimum number is 15, and also the maximum is 21.

Figure 3.8 Box plot representing students ages

4. Explain in detail about Scaling/Normalization and Standardizing. Or


What is scaling and standardization? When and why to standardize a
variable? Illustrate with suitable example. (NOV/DEC 2023) (APR
/MAY 2024)

 Scaling Or Normalization
 Scaling is changing the range of the distribution of the data without
changing the data itself..
 Transforming the data so that it fits within a specific scale, like 0-100 or
0-1 by Adding or subtracting a constant or Multiplying or dividing by a
Generally performed during the data pre-processing step and also helps
in speeding up the calculations in an algorithm. Constant

 Some Common Types of Scaling:


1. Simple Feature Scaling:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 27


AD3301 DEV UNIT 3

 This method simply divides each value by the maximum value for that
Feature.
 The resultant values are in the range between zero(0) and one(1)

2. Min-Max Scaling:

 Min-max scaling, also known as min-max normalization or Rescaling,


is the simplest method of scaling.
 It transforms the data so that all the values lie between 0 and 1.
 This is done by subtracting the minimum value from all the data points
and then dividing by the range (maximum value minus minimum
value).
 The resulting data will have a minimum value of 0 and a maximum
value of 1.
 This method is quick and easy to implement, but it can be distorted if
there are outliers in the data.

3. Mean Normalization :-
 The point of normalization is to change the observations so that they can
be described as a normal distribution.
 Normal distribution (Gaussian distribution), also known as the bell
curve, is a specific statistical distribution where a roughly equal
observations fall above and below the mean.
 It varies between -1 to 1 with mean = 0.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 28


AD3301 DEV UNIT 3

5. Z-Score
 Z- Score transforms the data so that the mean is 0 and the standard
deviation is 1.
 This is done by subtracting the mean from all the data points and then
dividing it by the standard deviation.
 The resulting data will have a mean of 0 and a standard deviation of 1.
 This method is less affected by outliers than min-max scaling, but it can
still be distorted if there are a lot of outliers.

6. Decimal scaling,
 Also known as Unit Norm, is a less common method of scaling.
 It transforms the data so that all the values have a maximum absolute
value of 1.
 This is done by dividing all the data points by the absolute value of the
maximum value.
 The resulting data will have a maximum absolute value of 1.
 This method is not affected by outliers, but it can be distorted if there are
a lot of small values.

 Importance of data scaling


 Data scaling is a critical step in data preparation and data wrangling.
 The goal is to transform the data so that it can be more easily analyzed and
compared.
 This can be done to make patterns more visible, to reduce the impact of
outliers, or to transform the data into a more suitable form for a particular
analysis or machine learning algorithm.
 Data scaling improves the accuracy of machine learning algorithms, make
patterns more visible, and make it easier to compare data sets.
 It is a critical step in data preparation and data wrangling.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 29


AD3301 DEV UNIT 3

 Scaling in Python
 Scaling can be done using sklearn.preprocessing module in python.
 Scaling data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

 Ways to Choose Scaling method


 Depending on the type of data
o If the data is categorical, then min-max scaling or z-score normalization
are not applicable.
o If the data is continuous but skewed, min-max scaling could compress
the data too much.
 Depending on the distribution of the data
o If the data is normally distributed, then z-score normalization is typically
the best option.
o If the data is skewed, then min-max scaling or logarithmic
transformation may be more effective.
 Depending on the goal of scaling
o If the goal is to equalize the range, then min-max scaling is the best
option.
o If the goal is to standardize the variance, then z-score normalization is
usually the best choice.
o If the goal is to make the data more interpretable, then logarithmic
transformation is often the best option.
 Standardization
Data Standardization Definition
 Data standardization is an important technique that is mostly
performed as a pre-processing step before inputting data into many
machine learning models, to standardize the range of features of an
input data set.
 Data standardization is applied when features of the input data set have
large differences between their ranges, or simply when they are
measured in different units (e.g., pounds, meters, miles, etc.).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 30


AD3301 DEV UNIT 3

How to Standardize Data


Z-SCORE
 Z-score is the most popular methods to standardize data, and can be
done by subtracting the mean and dividing by the standard deviation
for each value of each feature.

 Once the standardization is done, all the features will have a mean of
zero and a standard deviation of one, and thus, the same scale.

Python code for Standardization.

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())

When to standardize data


1. Before Principal Component Analysis (PCA)
 In principal component analysis, features with high variances or
wide ranges get more weight than those with low variances, and
consequently, they end up illegitimately dominating the first
principal components (components with maximum variance).
 Standardization can prevent this, by giving the same weightage to
all features.
2. Before Clustering
 Clustering models are distance-based algorithms.
 In order to measure similarities between observations and form
clusters a distance metric is used.
 So, features with high ranges will have a bigger influence on the
clustering.
 Therefore, standardization is required before building a clustering

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 31


AD3301 DEV UNIT 3

model.
3. Before K-Nearest Neighbors (KNN)
 k-nearest neighbors is a distance-based classifier that classifies
new observations based on similar measures (e.g., distance
metrics) with labeled observations of the training set.
 Standardization makes all variables contribute equally to the
similarity measures.
4. Before Support Vector Machine (SVM)
 Support vector machine tries to maximize the distance between the
separating plane and the support vectors.
 If one feature has very large values, it will dominate over other
features when calculating the distance.
 Standardization gives all features the same influence on the
distance metric.
5. Before measuring variable importance in regression models
 Can measure variable importance in regression analysis by fitting
a regression model using the standardized independent variables
and comparing the absolute value of their standardized coefficients.
 But, if the independent variables are not standardized, comparing
their coefficients becomes meaningless.
6. Before Lasso And Ridge Regressions
 Lasso and ridge regressions place a penalty on the magnitude of the
coefficients associated with each variable, and the scale of variables
will affect how much of a penalty will be applied on their coefficients.
 Coefficients of variables with a large variance are small and thus
less penalized.
 Therefore, standardization is required before fitting both
regressions.

Cases when standardization is not needed


 Logistic regressions and tree-based models
Logistic regressions and tree-based algorithms such as decision
trees, random forests and gradient boosting are not sensitive to the
magnitude of variables. So standardization is not needed before fitting
these kinds of models.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 32


AD3301 DEV UNIT 3

Benefits of Data Standardization


 Data standardization transforms data into a consistent, standard format,
making it easier to understand and use — especially for machine learning
models and across different computer systems and databases.
 Standardized data can facilitate data processing and storage tasks, as well
as improve accuracy during data analysis.
 It can reduce costs and save time for businesses.

Scaling/Normalization Vs. Standardization


Normalization Standardization
This method scales the model usin This method scales the model using
g minimum and maximum values. the mean and standard deviation.
When a variable's mean and standard
When features are on various scale
deviation are both set to 0,
s, it is functional.
it is beneficial.
Values on the scale fall between [0, Values on a scale are not constrained
1] and [-1, 1]. to a particular range.
Additionally known as scaling This process is called Z-score
normalization. normalization.
When the feature distribution is When the feature distribution is
unclear, it is helpful. consistent, it is helpful

5. Explain in detail about Smoothing Techniques for a time series data


with suitable example. (NOV/DEC 2023), (Apr/May 2025)
 Basic Terminologies
 Time-series: It’s a sequence of data points taken at successive equally
spaced points in time.
Examples of time-series data:
o Number of weekly infections of a disease
o Rainfall per day
o Population growth per year
 Level: The average value in the series.
 Trend: The increasing or decreasing value in the series.
 Seasonality: The repeating short-term cycle in the series.

 Symbols used:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 33


AD3301 DEV UNIT 3

 Smoothing Techniques
 Smoothing techniques are kinds of data preprocessing techniques to
remove noise from a data set.
 Data smoothing refers to a statistical approach of eliminating outliers
from datasets to make the patterns more noticeable.
 It can identify simplified changes to help predict different trends and
patterns.
 Data smoothing can help in identifying trends in businesses, financial
securities, and the economy.

 Four major smoothing technique


1. Moving average smoothing
2. Exponential smoothing
3. Double exponential smoothing
4. Triple exponential smoothing
5. Random walk smoothing

1. Moving average smoothing


 It is a simple and common type of smoothing used in time series analysis
and forecasting.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 34


AD3301 DEV UNIT 3

 The time series is derived from the average of last kth elements of the
series.

Figure 3.9 Moving average smoothing

 In the figure 3.9, smaller values of k lead to more variation in the


result, and a larger value of k leads to more smoothness.

Sample Python Code


def moving_avarage_smoothing(X,k):
S = np.zeros(X.shape[0])
for t in range(X.shape[0]):
if t < k:
S[t] = np.mean(X[:t+1])
else:
S[t] = np.sum(X[t-k:t])/k
return S

Types of Moving Average – Refer Figure 3.10


 Backward moving average
o The Backward moving average method (also called simple moving
average) is a widely used and simple smoothing method that
smooth’s each value by taking the average of the value and all
previous values within the time window.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 35


AD3301 DEV UNIT 3

o An advantage of this method is that it can be immediately performed


on streaming data; as a new value is recorded, it can be immediately
smoothed using previous data in the time series.
o However, this method has the drawback that the value being
smoothed is not in the center of the time window, so all information
comes from only one side of the value.
o This can lead to unexpected results if the trends of the data are not
the same on each side of the value being smoothed.
 Forward moving average
o The Forward moving average method is analogous to backward
moving average, but the smoothed value is instead the average of
the value and all subsequent values within the time window.
o It has the analogous drawback that all information used for
smoothing comes from one side of the value.
 Centered moving average
o The Centered moving average method smooths each value by
averaging within the time window, where the value being smoothed
is at the center of the window.
o For this method, the time window is split so that half of the window
is used before the time of the value being smoothed, and half of the
window is used after.
o This method has the advantage of using information before and after
the time of the value being smoothed, so it is usually more stable
and has smaller bias.
 Adaptive bandwidth local linear regression method
o The Adaptive bandwidth local linear regression method (also called
Friedman’s super smoother) smooth’s values using a centered time
window and fitting linear regression (straight line) models to the data
in multiple time window.
o The length of the time windows can change for each value, so some
sections of the time series will use wider windows to include more
information in the model.
o This method has the advantage that the time window does not need
to be provided and can be estimated by the tool.
o It is also the method best suited to model data with complex trends.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 36


AD3301 DEV UNIT 3

Figure 3.10 Types of Moving average smoothing

2. Single Exponential smoothing


 Exponential smoothing is a weighted moving average technique.
 In the moving average smoothing the past observations are weighted
equally,
 In this case smoothing is done by assigning exponentially decreasing
weights to the past observations. Refer Figure 3.11.

Figure 3.11 Exponential smoothing

Sample Python Code


def moving_avarage_smoothing(X,k):

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 37


AD3301 DEV UNIT 3

S = np.zeros(X.shape[0])
for t in range(X.shape[0]):
if t < k:
S[t] = np.mean(X[:t+1])
else:
S[t] = np.sum(X[t-k:t])/k
return S

3. Double exponential smoothing


 Single Smoothing does not excel in the data when there is a trend.
 This situation can be improved by the introduction of a second equation
with a second constant β.
 It is suitable to model the time series with the trend but without
seasonality.

 Here it is seen that α is used for smoothing the level and β is used for
smoothing the trend. Refer Figure 3.12.

Figure 3.12 Double exponential smoothing

Sample Python Code


def double_exponential_smoothing(X,α,β):
S,A,B = (np.zeros( X.shape[0] ) for i in range(3))
S[0] = X[0]

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 38


AD3301 DEV UNIT 3

B[0] = X[1] - X[0]


for t in range(1,X.shape[0]):
A[t] = α * X[t] + (1- α) * S[t-1]
B[t] = β * (A[t] - A[t-1]) + (1 - β) * B[t-1]
S[t] = A[t] + B[t]
return S

4. Triple exponential smoothing


 It is also called as Holt-winters exponential smoothing .it is used to
handle the time series data containing a seasonal component.
 Double smoothing will not work in case of data contain seasonality.so
that for smoothing the seasonality a third equation is introduced.

 In the above ϕ is the damping constant. α, β, and γ must be estimated


in such a way that the MSE(Mean Square Error) of the error is
minimized. Refer Figure 3.13.

Sample Python Code


def triple_exponential_smoothing(X,L,α,β,γ,ϕ):
def sig_ϕ(ϕ,m):
return
np.sum(np.array([np.power(ϕ,i) for i in range(m+1)]))
C, S, B, F = (np.zeros( X.shape[0] )
for i in range(4))
S[0], F[0] = X[0], X[0]
B[0] = np.mean( X[L:2*L] - X[:L] ) / L
m = 12
sig_ϕ = sig_ϕ(ϕ,m)
for t in range(1, X.shape[0]):
S[t] = α * (X[t] - C[t % L]) + (1 - α) * (S[t-1] + ϕ * B[t-1])

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 39


AD3301 DEV UNIT 3

B[t] = β * (S[t] - S[t-1]) + (1-β) * ϕ * B[t-1]

Figure 3.13 Triple exponential smoothing

5. Random Walk
 The random walk data smoothing method is commonly used for
describing the patterns in financial instruments.
 Some investors think that the past movement in the price of a
security and the future movements cannot be related.
 They use the random walk method, which assumes that a random
variable will give the potential data points when added to the last
accessible data point.

 Advantages and Disadvantages of Data Smoothing


Pros
 Helps identify real trends by eliminating noise from the data
 Allows for seasonal adjustments of economic data
 Easily achieved through several techniques including moving averages
Cons
 Removing data always comes with less information to analyze,
increasing the risk of errors in analysis
 Smoothing may ignore outliers that may be meaningful.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 40


AD3301 DEV UNIT 3

6. When it comes to the process of calculating numerical summaries.


discuss the role that statistical software plays. What are the ways in
which applications such as R, Python, or Excel make this procedure
easier? (Apr/May 2025)

Statistical software plays a crucial role in the process of calculating numerical


summaries by automating, simplifying, and enhancing the accuracy of computations.
Here's a detailed discussion on its role and how tools like R, Python, and Excel help
make this process easier:

Role of Statistical Software in Calculating Numerical Summaries

1. Automation and Speed


o Statistical software can quickly perform large and complex calculations
(e.g., mean, median, standard deviation, quartiles) that would be time-
consuming and error-prone if done manually.
2. Accuracy
o Software eliminates human error by using precise algorithms for
numerical computations.
3. Consistency
o Once the code or formula is written, it can be reused to generate
consistent summaries for different datasets.
4. Data Handling
o These tools can handle large datasets easily, performing operations that
are not feasible manually or with basic calculators.
5. Visualization Integration
o They often integrate numerical summaries with graphical tools (e.g.,
boxplots, histograms) for better data interpretation.

How R, Python, and Excel Simplify Numerical Summary Calculation

Excel

 Built-in Functions:
Functions like AVERAGE(), MEDIAN(), STDEV.P(), MODE(), and QUARTILE()
simplify the computation.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 41


AD3301 DEV UNIT 3

 Pivot Tables:
Useful for summarizing data by category, showing count, sum, average, etc.
 Data Analysis ToolPak:
Add-in that provides advanced statistical analysis like descriptive statistics
in a few clicks.

Python (with libraries like Pandas and NumPy)

 Code Example:

python
CopyEdit
import pandas as pd
data = pd.Series([10, 20, 30, 40, 50])
print(data.mean(), data.median(), data.std(), data.min(), data.max())

 Advantages:
o Easily handles large datasets.
o Reproducibility: code can be reused and shared.
o Integration with visualization tools (e.g., Matplotlib, Seaborn).

R (Designed for Statistical Analysis)

 Built-in functions:

R
CopyEdit
data <- c(10, 20, 30, 40, 50)
mean(data); median(data); sd(data); summary(data)

 Statistical Packages:
Rich ecosystem of packages like dplyr, ggplot2, and psych for advanced
summaries and visualizations.
 Data Frames:
Easily manipulate and summarize tabular data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 42


AD3301 DEV UNIT 4

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


AD3301 DATA EXPLORATION AND VISUALIZATION

SYLLABUS

UNIT IV

BIVARIATE ANALYSIS

SYLLABUS:
Relationships between Two Variables - Percentage Tables - Analyzing
Contingency Tables - Handling Several Batches - Scatterplots and Resistant
Lines – Transformations.

PART A
1. What is bivariate analysis?
 Bivariate analysis is one of the statistical analysis where two variables are
observed.
 One variable here is dependent while the other is independent.
 These variables are usually denoted by X and Y.
2. What is causal path model?
 Causal reasoning is done by the construction of a schematic model of the
hypothesized causes and effects: a causal path model.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 1


AD3301 DEV UNIT 4

 The variables are represented inside boxes or circles and labeled;


 Arrows run from the variables is the causes to those to be effects;
3. What is Proportions, Percentages and Probabilities?
 Proportion – the number in each category is divided by the total number of
cases N.
 Percentages – Proportions multiplied by 100.
 Probabilities – Represent the relative size of different subgroups in a
population.

4. Define a Contingency Table.


 A contingency table is a tabular representation of categorical data.
 Contingency table is similar to the three-dimensional bar chart.
 Contingent defines as 'true only under existing or specified conditions'.
 A contingency table displays frequencies for combinations of two categorical
variables.
 Analysts also refer to contingency tables as cross tabulation and two-way
tables.
 A contingency table shows the distribution of each variable conditional upon
each category of the other.

5. What is a percentage table?


 The commonest way to make contingency tables readable is to cast them in
percentage form.
 There are three different ways in which this can be done.
o Total percentages - The table was constructed by dividing each cell
frequency by the grand total.
o Row percentages - The table was constructed by dividing each cell
frequency by its appropriate row total. Tables that are constructed by
percentaging the rows are usually read down the columns. This is called
an 'outflow' table.
o Column percentages - The table was constructed by dividing each cell
frequency by its appropriate column total. This is called an 'inflow' table.

6. What are the guidelines for a well designed table?


• Labeling
• Sources
• Sample Data
• Missing Data
• Opinion Data
• Layout

7. What is a Chi-square test?


 Contingency tables can be used to perform a Chi-square test to determine
whether there is a significant association between the two variables.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 2


AD3301 DEV UNIT 4

 Calculate the Chi-square statistic using the formula:

Where:
• Oij is the observed frequency for the combination of categories i and j
• Eij is the expected frequency for the combination of categories i and j
• n is the number of rows • m is the number of columns

8. Define degree of freedom.


 Degrees of freedom are the number of independent values that a statistical
analysis can estimate.

9. What is a T- test?
 A T-test is the final statistical measure for determining differences between
two means that may or may not be related.
 It is a statistical method in which samples are chosen randomly, and there is
no perfect normal distribution.

10. What is a resistant line?


Resistant Line
 To explore paired data and to suspect a relationship between X and Y, the
focus is on how to fit a line to data in a “resistant” fashion, so the fit is
relatively insensitive to extreme points.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 3


AD3301 DEV UNIT 4

Fitting a Resistant Line - Line fitting involves joining two typical points.
 The X-axis is roughly divided into three parts
 X values are ordered along with its corresponding Y values.
 Divide X axis to three approximately equal length
 The left and right should be balanced with equal number of data
points
 Conditional Summary points for X and Y are found.
 A Line is drawn connecting a Left and Right Summary points

11. What is meant by log transformation?


 One method for transforming data or re-expressing the scale of
measurement is to take the logarithm of each data point.
 This keeps all the data points in the same order but stretches or shrinks
the scale by varying amounts at different points.

12. What are the goals of transformation?


 Data batches can be made more symmetrical.
 The shape of data batches can be made more Gaussian.
 Outliers that arise simply from the skewness of the distribution can be
removed, and previously hidden outliers may be forced into view'.
 Multiple batches can be made to have more similar spreads.
 Linear, additive models may be fitted to the data.

13. Define and list the ladder of powers.


The ladder of powers
 It is family of power transformations that can help promote symmetry and
sometimes Gaussian shape in many different data batches.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 4


AD3301 DEV UNIT 4

Going up the ladder of powers corrects downward straggle, whereas going


down corrects upward straggle.
14. Name the two main types of statistical testing in bivariate analysis.
(NOV/DEC 2023)

 Chi-square test

 T-test

15. Is bivariate qualitative or quantitative. (NOV/DEC 2023)

 Bivariate analysis is one type of quantitative analysis. It determines where


two variables are related.

16. How do you find the correlation of a scatter plot?

 Visual Inspection (Quick Estimate)

 Use Correlation Coefficient (Numerical Method)

 Interpret the Correlation Coefficient (r)

 Use Software Tools (Optional for Accuracy)


17. In What Scenarious Would you Prefer using a Percentage table over a bar
chart? (APR/MAY 2025)

• Precise Numerical Values Needed

• Comparing Multiple Categories Across Several Groups

• Small Differences Between Values

• Detailed Breakdown and Multiple Metrics

• Formal Reports or Documents

• Limited Visual Emphasis Needed.

18. In Order to evaluate the connection between two variables, what are some
ways that scatter plots mights be utilized? (APR/MAY 2025)

• Visualizing Correlation

• Identifying the Strength of the Relationship

• Detecting Outliers

• Understanding the Form of the Relationship

• Comparing Groups

• Checking for Clusters or Subgroups

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 5


AD3301 DEV UNIT 4

PART B

1. Explain in detail about Relationships between Two Variables (bivariate


relationships)
 Bivariate Analysis is used to explore the relationship between 2 different
variables.
 Relationships between two variables (bivariate relationships) are one variable
can be considered a cause and the other an effect.
 The variable that is presumed to be the cause the explanatory variable (and
denote it X) and the one that is presumed to be the effect the response variable
(denoted Y); they are termed independent and dependent variables
respectively.
 Causal reasoning is often assisted by the construction of a schematic model
of the hypothesized causes and effects: a causal path model.

 The variables are represented inside boxes or circles and labeled;


 Arrows run from the variables is the causes to those to be effects;
 Positive effects are drawn as unbroken lines and negative effects are drawn as
dashed lines.
 A number is placed on the arrow to denote how strong the effect of the
explanatory variable is.
 An extra arrow is included as an effect on the response variable, often
unlabeled,

2. Explain in detail about Percentage Tables and Contingency Table.


(APR /MAY 2024) (APR/MAY 2025)
CONTINGENCY TABLE
 A contingency table is a tabular representation of categorical data.
 Contingency table is similar to the three-dimensional bar chart.
 Contingent defines as 'true only under existing or specified conditions'.
 A contingency table displays frequencies for combinations of two categorical
variables.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 6


AD3301 DEV UNIT 4

 Analysts also refer to contingency tables as cross tabulation and two-way


tables.
 A contingency table shows the distribution of each variable conditional upon
each category of the other.
 Contingency tables classify outcomes for one variable in rows and the other in
columns.
 The values at the row and column intersections are frequencies for each
unique combination of the two variables.
 Each individual case is then tallied in the appropriate cells depending on its
value on both variables. and the number of cases in each cell is called the cell
frequency.
 Each row and column can have a total presented at the right-hand end and at
the bottom respectively; these are called the marginal, and the univariate
distributions can be obtained from the marginal distributions.
 Figure 4.1 shows a schematic contingency table with four rows and four
columns (a four-by-four table).

Figure 4.1 - Anatomy of a contingency table.

Finding Relationships in a Contingency Table


 In the contingency table below, the two categorical variables are gender and
ice cream flavor preference.
 Below table 4.1 is a two-way table (2 X 3) where each cell represents the
number of times males and females prefer a particular ice cream flavor.
Table 4.1 - A two-way table (2 X 3)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 7


AD3301 DEV UNIT 4

 The contingency table 4.2 below uses the same raw data as the previous
table and displays both row and column percentages.

Table 4.2 – Contingency Table

Graph a Contingency Table


 Use bar charts to display a contingency table.
 The following figure 4.2 shows the row percentages for the previous two-
way table.

Figure 4.2 - Clustered bar chart

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 8


AD3301 DEV UNIT 4

PERCENTAGE TABLES
 The commonest way to make contingency tables readable is to cast them in
percentage form.
 There are three different ways in which this can be done.
 Total percentages - The table was constructed by dividing each cell frequency
by the grand total.
 Row percentages - The table was constructed by dividing each cell frequency
by its appropriate row total. Tables that are constructed by percentaging the
rows are usually read down the columns. This is called an 'outflow' table.
 Column percentages - The table was constructed by dividing each cell
frequency by its appropriate column total. This is called an 'inflow' table.
 The inflow and outflow tables focus attention on the data in different ways,
and the researcher have to be clear about what questions were being
addressed in the analysis to inform which way the percentages were
calculated.

GOOD TABLE MANNERS


 Labeling
 A clear title should summarize the contents.
 It should be as short as possible, while at the same time making clear when
the data were collected, the geographical unit covered, and the unit of
analysis.
 Other parts of a table also need clear, informative labels.
 The variables included in the rows and columns must be clearly identified.
 Sources
 The reader needs to be told the source of the data.
 It is not good enough to say that it was from Social Trends. The volume and
year, and either the table or page, and sometimes even the column in a
complex table must be included.
 When the data are first collected from a published source, all these things
should be recorded, or a return trip to the library will be needed.
 Sample data
 If data are based on a sample drawn from a wider population, it always
needs special referencing.
 The reader must be given enough information to assess the adequacy of
the sample.
 The following details should be available somewhere:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 9


AD3301 DEV UNIT 4

o The method of sampling


o the achieved sample size,
o the response rate or refusal rate,
o the geographical area which the sample covers
o the frame from which it was drawn.
 Missing data
 Don't exclude cases from analysis, miss out particular categories of a
variable or ignore particular attitudinal items in a set without good
reason and without telling the reader what you are doing and why.
 Definitions
 There can be no hard and fast rule about how much definitional
information to include in the tables.
 They could become unreadable if too much were included.
 Opinion data
 When presenting opinion data, always give the exact wording of the
question put to respondents, including the response categories if these
were read out.
 Ensuring frequencies can be reconstructed
 It should always be possible to convert a percentage table back into the
raw cell frequencies.
 To retain the clarity of a percentage table, present the minimum number
of base Ns needed for the entire frequency table to be reconstructed.
 Showing which way the percentages run
 Proportions add up to 1 and percentages add up to 100.
 Layout
 The effective use of space and grid lines can make the difference between
a table that is easy to read and one which is not.
 Avoid underlining words or numbers.
 Clarity is often increased by reordering either the rows or the columns.
1. Closer figures are easier to compare;
2. Comparisons are more easily made down a column;
3. A variable with more than three categories is best put in the rows so
that there is plenty of room for category labels

3. Explain in detail about the Analysis of Contingency Tables with example.


Or How do you analyze contingency tables? Give examples. (NOV/DEC
2023)

Contingency Tables Analysis:


 Contingency tables analysis is a central branch of categorical data analysis
and is focused on the analysis of data represented as contingency tables.
 Contingency tables analysis is widely used in marketing research, in
biomedical research, including drug trials, and in social sciences.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 10


AD3301 DEV UNIT 4

 A contingency table is a tool used to summarize and analyze the relationship


between two categorical variables.
 It is a type of cross-tabulation that displays the frequencies or counts of the
combinations of categories for the two variables.
 To create a contingency table, the following steps are typically followed:
 Identify the two categorical variables to be analyzed.
 Collect and summarize the data. Count the number of observations in
each combination of categories for the two variables.
 Organize the data in a table with the categories of the first variable listed
along the rows and the categories of the second variable listed along the
columns.
 Enter the counts or frequencies in the cells of the table.
Chi-square test of independence hypotheses
 Contingency tables can be used to perform a Chi-square test to determine
whether there is a significant association between the two variables.
 To do this, the following steps are typically followed:
 Calculate the expected frequencies for each combination of categories
using the formula:

 Where:
• Eij is the expected frequency for the combination of categories i and j
• Ri is the row total for category i
• Cj is the column total for category j
• n is the total sample size
 Calculate the Chi-square statistic using the formula:

 Where:
• Oij is the observed frequency for the combination of categories i and j
• Eij is the expected frequency for the combination of categories i and j
• n is the number of rows • m is the number of columns
 Determine the critical value of the Chi-square statistic based on the
significance level (alpha) of the test and the degrees of freedom.
 The degrees of freedom (df): For a chi-square test of independence, the
df is (number of variable 1 groups − 1) * (number of variable 2 groups −
1).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 11


AD3301 DEV UNIT 4

 Significance level (α): By convention, the significance level is usually


.05.
 Compare the calculated Chi-square statistic to the critical value to
determine whether to reject or fail to reject the null hypothesis.
o Null hypothesis (H0): Variable 1 and variable 2 are not related in the
population; The proportions of variable 1 are the same for different
values of variable 2.
o Alternative hypothesis (Ha): Variable 1 and variable 2 are related in the
population; The proportions of variable 1 are not the same for different
values of variable 2.
 If the Χ2 value is greater than the critical value, then the difference
between the observed and expected distributions is statistically
significant (p < α).
The data allows to reject the null hypothesis that the variables are
unrelated and provides support for the alternative hypothesis that the
variables are related.
 If the Χ2 value is less than the critical value, then the difference between
the observed and expected distributions is not statistically significant
(p > α).
The data doesn’t allow to reject the null hypothesis that the variables
are unrelated and doesn’t provide support for the alternative hypothesis
that the variables are related.

 Contingency tables are also used to analyze the relationship between a


categorical variable and a continuous variable.
 In this case, the continuous variable is typically grouped into intervals or
categories, and a contingency table is created to summarize the frequencies or
counts for each combination of categories.

Example
Six months after the intervention, the city looks at the outcomes for the 300
households (only four households are shown here):
Observed Frequency

Intervention Recycles Does not recycle Row totals


Flyer (pamphlet) 89 9 98
Phone call 84 8 92
Control 86 24 110
Column totals 259 41 N = 300

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 12


AD3301 DEV UNIT 4

Expected Frequency

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 13


AD3301 DEV UNIT 4

Follow these five steps to calculate the test statistic:


Step 1: Create a table
Create a table with the observed and expected frequencies in two
columns.

Step 2: Calculate O − E
In a new column called “O − E”, subtract the expected frequencies
from the observed frequencies.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 14


AD3301 DEV UNIT 4

Step 3: Calculate (O – E)2


In a new column called “(O − E)2”, square the values in the previous
column.

Step 4: Calculate (O − E)2 / E


In a final column called “(O − E)2 / E”, divide the previous column by
the expected frequencies.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 15


AD3301 DEV UNIT 4

Step 5: Calculate Χ2
Finally, add up the values of the previous column to calculate the chi-
square test statistic (Χ2).

Example: Finding the critical chi-square value


 Since there are three intervention groups (flyer, phone call, and control)
and two outcome groups (recycle and does not recycle) there are (3 − 1) *
(2 − 1) = 2 degrees of freedom.
 For a test of significance at α = .05 and df = 2, the Χ2 critical value is
5.99.

Compare the chi-square value to the critical value


 Example: Comparing the chi-square value to the critical value
 Χ2 = 9.79 Critical value = 5.99
 The Χ value is greater than the critical value.
2

Decide whether to reject the null hypothesis


 The Χ2 value is greater than the critical value.
 Therefore, the city rejects the null hypothesis that whether a household
recycles and the type of intervention they receive are unrelated.
 The city concludes that their interventions have an effect on whether
households choose to recycle.

4. Explain the methods used in Handling Several Batches or communicating


to a group of people.
Boxplot
 It is important to display data well when communicating it to others.
 The boxplot is a device for conveying the information in the five number
summaries economically and effectively.
 Refer Figure 4.3 for the Anatomy of the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 16


AD3301 DEV UNIT 4

Figure 4.3 – Anatomy of Box plot

 The middle 50 per cent of the distribution is represented by a box.


 The median is shown as a line dividing that box.
 Whiskers are drawn connecting the box to the end of the main body of the
data.
 They extend to the adjacent values, the data points which come nearest to the
inner fence while still being inside or on them.
 Outliers are points that are unusually distant from the rest of the data.
 Then the points beyond which the outliers fall (the inner fences) and the points
beyond which the far outliers fall (the outer fences) are identified; inner fences
lie one step beyond the quartiles and outer fences lie two steps beyond the
quartiles.

Outlier
 Data Set contain points which are lot higher or lower than the main body of
the data . These are called Outliers.
 Reasons for Outliers

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 17


AD3301 DEV UNIT 4

o They may just result from a fluke of the particular sample that was drawn.
o They may arise through measurement or transcription errors, which can
occur in official statistics as well as anywhere else.
o They may occur because the whole distribution is strongly skewed.
o These particular data points do not really belong substantively to the same
data batch.

Multiple boxplots
 Boxplots, laid out side by side, permit comparisons to be made with ease.
 The standard four features of each region's distribution can now be compared:
o The level
o The spread
o The Shape
o Outliers
Example: Refer Figure 4.4

Figure 4.4 – Example of Multiple Box plot

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 18


AD3301 DEV UNIT 4

T-Test
o A T-test is the final statistical measure for determining differences between
two means that may or may not be related.
o It is a statistical method in which samples are chosen randomly, and there
is no perfect normal distribution.

5. Discuss in detail about Scatterplots and Linear Relationships. Or Discuss


the best Practices for Designing Scatter Plots. (NOV/DEC 2023)
(APR/MAY 2025)
Scatterplots
 To depict the information about the value of two interval level variables at
once, each case is plotted on a graph known as Scatterplot.
 A Scatterplot has two axes, a vertical axes Y and a horizontal axes X.
 The cause variable (Explanatory variable) is placed on X axis and effect
variable (Response Variable) is placed on Y axis.
 Scatterplot depict bivariate relationships.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 19


AD3301 DEV UNIT 4

Figure 4.5 – Scatterplot – Monotonic and Positive Relationship


Example
 The data in Table 4.1 relate to the percentage of households that are
headed by a lone parent and contain dependent children, and the
percentage of households that have no van or car.
Table 4.1 Lone parent households and households with no car or van,
% by region.

Figure 4.6 –

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 20


AD3301 DEV UNIT 4

Linear Relationships
 Straight lines are easy to visualize and to draw on a graph and can be
expressed algebraically
Y = a + bX
 X and Y are variables and a and b are coefficients that quantify any
particular line.
 Figure 4.7 shows the Anatomy of Straight Line

Figure 4.7 – Anatomy of Straight line


 The degree of slope or gradient of the line is given by the coefficient b; the
steeper the slope, the bigger the value of b.
 The intercept a is the value of Y when X is zero.
 This value is also sometimes described as the constant.
 The slope of a line can be derived fb rom any two points on it.
 If we choose two points on the line, one on the left-hand side with a low X
value (called XL' YL), and one on the right with a high X value (called XR,
YR), then the slope is

 If the line slopes from top left to bottom right, YR - YL will be negative and
thus the slope will be negative.

Limitations to draw a line


 Make half the points lie above the line and half below along the full length
of the line.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 21


AD3301 DEV UNIT 4

 Make each point as near to the line as possible


 Make each point as near to the line in Y direction as possible.
 Make the squared distance between each point and the line in Y direction
as small as possible. This technique is known as Linear Regression.
6. Define Resistant Line and explain about fitting a Resistant Line.
Resistant Line
 To explore paired data and to suspect a relationship between X and Y, the
focus is on how to fit a line to data in a “resistant” fashion, so the fit is
relatively insensitive to extreme points.
Fitting a Resistant Line - Line fitting involves joining two typical points.
 The X-axis is roughly divided into three parts
 X values are ordered along with its corresponding Y values.
 Divide X axis to three approximately equal length
 The left and right should be balanced with equal number of data
points
 Conditional Summary points for X and Y are found.
 A Line is drawn connecting a Left and Right Summary points

Example
Table 4.2 Worksheet for calculating a resistant line.

Obtain the Summary Points


 The summary X value is the median X in each third; in the first third of the
data, the summary X value is 5.29, the value for the Eastern region.
 The median Y in each third becomes the summary Y value, here 19.8.
 The summary X and Y values for each of batches is

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 22


AD3301 DEV UNIT 4

 The slope

 The intercept is the average of below three that is -18.1

 The full prediction equation is

 The full set of predicted values is shown in column 3 of Table 4.2; the
column is headed Y (pronounced 'Y-hat'), a common notation for fitted
values.
 In the South East the percentage of households with no car is 1.04 higher
than the predicted 18.39, namely 19.43.
 Residuals from the fitted values can also be calculated for each region, and
these are shown in column 4 of Table 4.2.
 All the data values can be recast in the traditional DFR form:

 The simple technique for fitting the 'best' straight line through the
Scatterplot minimizes the total sum of these residual values.

7. Define Transformation and explain in detail about transformations.


The log transformation
 One method for transforming data or re-expressing the scale of
measurement is to take the logarithm of each data point.
 This keeps all the data points in the same order but stretches or shrinks
the scale by varying amounts at different points.
The ladder of powers
 It is family of power transformations that can help promote symmetry and
sometimes Gaussian shape in many different data batches.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 23


AD3301 DEV UNIT 4

Figure 4.8 - The ladder of powers

 Going up the ladder of powers corrects downward straggle, whereas going


down corrects upward straggle.
The goals of transformation
 Data batches can be made more symmetrical.
 The shape of data batches can be made more Gaussian.
 Outliers that arise simply from the skewness of the distribution can be
removed, and previously hidden outliers may be forced into view'.
 Multiple batches can be made to have more similar spreads.
 Linear, additive models may be fitted to the data.
Determining the best power for transformation
 When investigating a transformation to promote symmetry in a single batch,
first examine the midpoint summaries - the median, the mid quartile, and
the mid extreme – to check if they tend to increase or decrease.
 If they systematically increase in value, a transformation lower down the
ladder should be tried.
 If the midpoint summaries the trend downwards, the transformation was
too powerful, and must move back up the ladder.
 Continue experimenting with different transformations from the summary
values until the best to promote symmetry and Gaussian shape.
 Curves which are monotonic and contain only one bend can be thought of
as one of the four quadrants of a circle.
 To straighten out any such curves, first draw a tangent.
 Then imagine pulling the curve towards the tangent (as shown in figure 4.9).
 Notice the direction in which having to pull the curve on each axis, and
move on the ladder of powers accordingly.
 To straighten the data in figure 4.9, for example, the curve has to be pulled
down in the ¥-direction and up in the X-direction; linearity will therefore

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 24


AD3301 DEV UNIT 4

probably be improved by raising the Y variable to a power lower down on


the ladder and/or by raising the X variable to a power higher up on the
ladder.

Figure 4.9 Guide to linearizing transformations for curves.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 25


AD3301 DEV UNIT 5

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


AD3301 DATA EXPLORATION AND VISUALIZATION
SYLLABUS
UNIT V
MULTIVARIATE AND TIME SERIES ANALYSIS
SYLLABUS: Introducing a Third Variable – Causal Explanations – Three-Variable
Contingency Tables and Beyond – Longitudinal Data – Fundamentals of TSA –
Characteristics of time series data – Data Cleaning – Time-based indexing – Visualizing
– Grouping – Resampling.

PART A
1. Define Multivariate Analysis. Or What is Multivariate Analysis? (NOV/DEC
2023)
Multivariate Analysis is a set of statistical model that examine patterns
in multidimensional data by considering at once, several data variable.

2. What is Simpson’s paradox?


Simpsons Paradox is a statistical phenomenon that occurs when the
subgroups are combined into one group. The process of aggregating data can
cause the apparent direction and strength of the relationship between two
variables to change.

3. Define regression analysis.


Regression analysis is a method for predicting the values of a
continuously distributed dependent variable from an independent, or
explanatory, variable.

4. What is the principle of logistic regression model?


The principles behind logistic regression are very similar and the
approach to building models and interpreting the models is virtually identical.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 1


AD3301 DEV UNIT 5

5. Distinguish longitudinal data from time series data.


It is important to distinguish longitudinal data from the time series
data. Although time series data can provide a picture of aggregate change, it
is only longitudinal data that can provide evidence of change at the level of
the individual.

6. Define longitudinal data.


Longitudinal data is data that is collected sequentially from the same
respondents over time. This type of data can be very important in tracking
trends and changes over time by asking the same respondent’s questions in
several waves carried out of time.

7. What is time series data?


Time series data is a collection of observations obtained through
repeated measurements over time. Plot the points on a graph, and one of your
axes would always be time.

8. Define time series analysis.


 Time series analysis is the collection of data at specific intervals over a
period to identify trends, seasonality, and residuals to aid in forecasting a
future event.
 Time series analysis involves inferring what has happened to a series of
data points in the past and attempting to predict future values.

9. What are the fundamentals of TSA.


 Generate a normalized dataset randomly:
 Generation of the dataset using the numpy library.
 Plotting the time series data using the seaborn library.
 By plotting the list using the time series plot, an interesting graph that
shows the change in values over time is obtained.

10. What is data cleaning?


Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 2


AD3301 DEV UNIT 5

11. What are steps for data cleaning for outliers?


 Checking the shape of the dataset
 Few entries can also be checked inside the data frame.
 The data types of each column are reviewed in the df_power.

12. What is Time-based Indexing?


Time-based indexing is a very powerful method of the
pandas library when it comes to time series data. Having time-based
indexing allows using a formatted string to select data.

13. What is grouping time series data?


Grouped time series data involves more general aggregation structures
than hierarchical time series. The structure of grouped time series does not
naturally disaggregate in a unique hierarchical manner.

14. Explain briefly about resampling time series data.


Resampling is used in time series data. This is a convenience method
for frequency conversion and resampling of time series data. Although it
works on the condition that objects must have a datetime-like index for
example, DatetimeIndex, PeriodIndex, or TimedeltaIndex.

15. What are the two common techniques used to perform dimension
reduction? (NOV/DEC 2023)
Two common techniques used to perform dimension reduction are:
1. Principal Component Analysis (PCA):
o PCA transforms the original features into a new set of
uncorrelated variables called principal components.
o These components capture the maximum variance in the
data using fewer dimensions.
o It is a linear technique.
2. t-Distributed Stochastic Neighbor Embedding (t-SNE):
o t-SNE is a non-linear technique mainly used for
visualization of high-dimensional data in 2 or 3
dimensions.
o It emphasizes preserving the local structure of the data
(i.e., clusters and neighbors).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 3


AD3301 DEV UNIT 5

16. Define least square method in time series. (PR / MAY 2024)

o The least squares method in time series is a technique used to fit a


model to time series data.

o It minimizes the sum of the squared differences between the observed


values and the values predicted by the model.

o This method is used to estimate the parameters of the model.

17. List the techniques used in smoothing time series. (APR / MAY 2024)

 Moving Average (MA)

 Weighted Moving Average (WMA)

 Exponential Smoothing

 Regression-Based Smoothing

 Lowess / LOESS (Locally Weighted Scatterplot Smoothing)

 Kalman Filtering

 Spline Smoothing

 Gaussian Smoothing

 Savitzky–Golay Filter

18. What do experimental investigations have to offer in terms of esblishing


causality? (APR/MAY 2025)

• Control Over Variables

• Random Assignment

• Manipulation and Observation

• Replication

• Temporal Order Established

19. To what extent does a continency table with three variables

differ from a table with only two variables? (APR/MAY 2025)

 Two-variable table: A simple matrix showing frequencies for combinations


of two categorical variables (e.g., gender vs. preference).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 4


AD3301 DEV UNIT 5

 Rows = categories of variable A


 Columns = categories of variable B
 Cells = frequency counts

Three-variable table: Adds a third dimension—usually shown by:

 Creating multiple two-variable tables, one for each level of the third
variable (called a “layered” or “nested” approach).
 Example: A set of separate gender vs. preference tables, one for each
age group.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 5


AD3301 DEV UNIT 5

PART B
1. Explain third variable and describe Causal Explanations in detail with
suitable example. (NOV/DEV 2024)
Explain in detail about Introducing a third variable.
Ways of holding a third variable constant while assessing the relationship
between two others.
 Third Variable
o X1 denotes a predictor variable, Y denotes an outcome variable, and X2
denotes a third variable that may be involved in the X1 , Y relationship.
o For example, age (X1) is predictive of systolic blood pressure (SBP) (Y)
when body weight (X2) is statistically controlled.
o A third-variable effect (TVE) refers to the effect conveyed by a third-
variable to an observed relationship between an exposure and a
response variable of interest.
o Depending on a causal relationship from the exposure variable to the
third-third variable and then to the response, the third-variable
(denoted as M) is often called a mediator (when there are causal
relationships) or a confounder (no causal relationship is involved).
o In third-variable analysis, besides the pathway that directly connect the
exposure variable with the outcome, explore the exposure → third-
variable → response or X → M → Y pathways.

 Causal explanation
o It explains how and why an effect occurs, provides information
regarding when and where the relationship can be replicated.
o Causality refers to the idea that one event, behavior, or belief will result
in the occurrence of another subsequent event, behavior, or belief, it is
about cause and effect.
o X causes Y is, if X changes, it will produce a change in Y.
o Independent variables - may cause direct changes in another variable.
o Control variables - remain unchanged during the experiment.
o Causation - describes the cause-and-effect relationship.
o Correlation - Any relationship between two variables in the experiment.
o Dependent variables - may change or are influenced by the
independent variable.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 6


AD3301 DEV UNIT 5

Direct and indirect effects


o Causal relations are commonly modeled as a Directed Acyclic Graph
(DAG), where a node represents a data dimension and a link represents
the dependency between two connected dimensions.
o The arrows of the links indicate the direction of the cause-effect
relationship.
o A path is a sequence of arrows connecting two nodes regardless of the
direction.
o There are three types of paths:
o The chain graph: T → X → Y
o The Fork: T ← X → Y
o The Immorality: T → X ← Y
o Causal Diagrams - depict the causal relationships between variables.

Figure 5.2 – Causal Example


o The above figure illustrates
independence of A and B
direct dependence of C on A and B
direct dependence of E on A and C
direct dependence of F on C and
direct dependence of D on B

Simpson's paradox
o In some cases the relationship between two variables is not simply
reduced when a third, prior, variable is taken into account but indeed
the direction of the relationship is completely reversed.
o This is often known as Simpson's paradox (named after Edward
Simpson).

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 7


AD3301 DEV UNIT 5

o Simpson's paradox can be succinctly summarized as follows: every


statistical relationship between two variables may be reversed by
including additional factors in the analysis.

2. Explain in detail about the fundamentals of TSA – Time Series Analysis.


Or What is TSA analysis? Explain ARIMA, Smooth-based and moving
average. (NOV/DEC 2023)
 Time Series Analysis (TSA) involves the study of data collected over time to
understand patterns, trends, and behaviors.
 The key fundamentals of TSA:
1. Time Series Definition:
o A time series is a series of data points indexed in time order.
o It is a sequence of observations or measurements taken at
successive, evenly spaced intervals.
2. Components of Time Series:
o Trend: The long-term movement or general direction in the data. It
indicates whether the data is increasing, decreasing, or stable over
time.
o Seasonality: Patterns that repeat at fixed intervals, often related to
calendar time, like daily, weekly, or yearly cycles.
o Cyclic Patterns: Repeating patterns that are not strictly periodic but
occur at irregular intervals.
o Irregular/Random Fluctuations (Noise): Unpredictable variations
that are not part of the trend, seasonality, or cyclic patterns.
3. Importance of TSA:
o Prediction and Forecasting: TSA is used to predict future values based
on historical patterns.
o Anomaly Detection: Identify unusual or unexpected patterns that
deviate from the norm.
o Pattern Recognition: Understand and characterize underlying trends
and cycles in the data.
4. Common TSA Techniques:
o Moving Averages: Smooth out short-term fluctuations to highlight
trends.
o Exponential Smoothing: Assign different weights to different
observations, emphasizing recent data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 8


AD3301 DEV UNIT 5

o ARIMA (Autoregressive Integrated Moving Average): A popular


model combining autoregression, differencing, and moving averages for
forecasting.
5. Data Exploration and Visualization:
o Time Plots: Visualize the time series data to observe trends and
patterns.
o Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF) Plots: Examine correlation between current and past
observations.
6. Stationarity:
o Stationary Series: A time series is considered stationary when
statistical properties like mean and variance remain constant over time.
Many time series models assume stationarity.
7. Modeling and Forecast Evaluation:
o Model Building: Develop models based on identified components and
patterns.
o Model Evaluation: Assess model accuracy using metrics like Mean
Absolute Error (MAE), Mean Squared Error (MSE), or others.
8. Software and Tools:
o Statistical Packages: Use tools like R or Python with libraries such as
Pandas, NumPy, and Statsmodels.
o Visualization Tools: Employ plotting libraries like Matplotlib or
Seaborn for graphical representation.

3. Explain in detail about the TSD – Time Series Data.


1. Characteristics of Time Series Data:
 Temporal Order: Time series data is ordered chronologically, with
observations recorded over successive time intervals.
 Trend: Long-term movement or directionality in the data.
 Seasonality: Regular patterns that repeat at fixed intervals.
 Noise: Random fluctuations that are not part of the trend or
seasonality.
2. Data Cleaning for Time Series:
 Missing Values: Address and impute missing data points, which is
common in time series datasets.
 Outliers: Identify and handle outliers that may distort the analysis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 9


AD3301 DEV UNIT 5

 Consistent Frequency: Ensure a consistent time interval between


observations.
3. Time-based Indexing:
 Datetime Index: Assign a datetime index to the time series data,
allowing for easy temporal slicing and manipulation.
 Pandas Library: Utilize libraries like Pandas in Python to work
efficiently with time-based indexing.
4. Visualizing Time Series Data:
 Line Plots: Display the trend and fluctuations over time.
 Seasonal Decomposition: Separate the time series into components
(trend, seasonality, and residual) for clearer analysis.
 Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF) Plots: Examine autocorrelation between observations
at different time lags.
5. Grouping in Time Series Analysis:
 Aggregation: Group data based on time periods (e.g., daily, weekly) and
aggregate values for analysis.
 Rolling Windows: Analyze trends over moving time windows, providing
a smoothed view of the data.
6. Resampling in Time Series Analysis:
 Upsampling: Increase the frequency of data (e.g., from daily to hourly)
by interpolation.
 Downsampling: Decrease the frequency of data (e.g., from hourly to
daily) by aggregation.

Time Series Data Visualization using Python


Importing the Libraries
 Numpy – A Python library that is used for numerical mathematical
computation and handling multidimensional ndarray, it also has a very
large collection of mathematical functions to operate on this array.
 Pandas – A Python library built on top of NumPy for effective matrix
multiplication and dataframe manipulation, it is also used for data
cleaning, data merging, data reshaping, and data aggregation.
 Matplotlib – It is used for plotting 2D and 3D visualization plots, it also
supports a variety of output formats including graphs for data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 10


AD3301 DEV UNIT 5

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Loading The Dataset


 To load the dataset into a dataframe use the pandas read_csv() function.
 head() function print the first five rows of the dataset.
 ‘parse_dates’ parameter in the read_csv function to
convert the ‘Date’ column to the DatetimeIndex format.
Python
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")

# displaying the first five rows of dataset


df.head()
Output:

Unnamed: 0 Open High Low Close Volume Name


Date
2006-01-03 NaN 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 NaN 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 NaN 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 NaN 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 NaN 43.10 43.66 42.82 43.42 16268338 AABA

Dropping Unwanted Columns


 Drop columns from the dataset that are not important for our
visualization.
Python
# deleting column
df.drop(columns='Unnamed: 0')

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 11


AD3301 DEV UNIT 5

Open High Low Close Volume Name


Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA

Plotting Line plot for Time Series data.


Python
df['Volume'].plot()
Output:

Plot all other columns using a subplot.

Python
df.plot(subplots=True, figsize=(4, 4))

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 12


AD3301 DEV UNIT 5

The line plots used above are good for showing seasonality.

Seasonality: In time-series data, seasonality is the presence of variations


that occur at specific regular time intervals less than a year, such as weekly,
monthly, or quarterly.

Resampling:
 Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population
parameter.
 Resampling for months or weeks and making bar plots is another very
simple and widely used method of finding seasonality.

Resample and Plot The Data


Python
# Resampling the time series data based on monthly 'M' frequency
df_month = df.resample("M").mean()

# using subplot
fig, ax = plt.subplots(figsize=(6, 6))

# plotting bar graph


ax.bar(df_month['2016':].index,
df_month.loc['2016':, "Volume"],
width=25, align='center')
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 13


AD3301 DEV UNIT 5

There are 24 bars in the graph and each bar represents a month.

Differencing: Differencing is used to make the difference in values of a


specified interval. By default, it’s one, we can specify different values for plots.
It is the most popular method to remove trends in the data.

Python

df.Low.diff(2).plot(figsize=(6, 6))

Output:

4. Explain in detail about data cleaning. (NOV/DEC 2024)


Clean the dataset for outliers:
1. Checking the shape of the dataset:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 14


AD3301 DEV UNIT 5

Code:
df_power.shape
Output:
(4383, 5)
The dataframe contains 4,283 rows and 5 columns.
2. Few entries can also be checked inside the dataframe.
The last 10entries can be examined by using the following
Code:
df_power.tail(10)
Output:

3. The data types of each column are reviewed in the


df_power dataframeby: df_power.dtypes
Output:
Date object
Consumption
float64 Wind
float64
Solar float64
Wind+Solar
float64 dtype:
object

 The Date column has a data type of object. This is not correct.
So, the next step is to correct the Date column, as shown here:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 15


AD3301 DEV UNIT 5

convert object to datetime format


df_power['Date'] = pd.to_datetime(df_power['Date'])
 It should convert the Date column to Datetime format.
 This can beverified using the following code:
df_power.dtypes

Output:
Date
dateti
me64[
ns]
Consu
mption
float64
Wind
float64
Solar
float64
Wind+Sol
ar float64
dtype:
object

The Date column has been changed to the correct data type.

 The index of the dataframe can be changed to the


Date column:df_power =
df_power.set_index('Date') df_power.tail(3)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 16


AD3301 DEV UNIT 5

Output:

 In the preceding screenshot, the Date column has been set as


DatetimeIndex.
 This can be simply verified by using the code snippet given here:
df_power.index
Output:
DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03',
'2006-01-04', '2006-01-05', '2006-01-06', '2006-01-07',
'2006-01-08', '2006-01-09', '2006-01-10', ... '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26',
'2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30',
'2017-12-31'],
dtype='datetime64[ns]',name='Date', length=4383,freq=None)

Since the index is the DatetimeIndex object, it can be used to analyze the
dataframe. To make our lives easier, more columns need to be added to the
dataframe.
Adding Year, Month, and Weekday Name:
Add columns with year, month,
and weekday name
df_power['Year'] =
df_power.index.year
df_power['Month'] =
df_power.index.month
df_power['Weekday Name'] = df_power.index.weekday_name
Let's display five random rows from the dataframe:
Display a random sampling of 5 rows
df_power.sample(5, random_state=0)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 17


AD3301 DEV UNIT 5

Output:

 Three more columns are —Year, Month, and Weekday Name.


 Adding these columns helps to make the analysis of data easier.
Time-based indexing (APR/MAY 2025)

 Time-based indexing is a very powerful method of the pandaslibrary


when it comes to time series data.

 Having time-based indexing allows using a formatted string toselect data.


Code:
df_power.loc['2015-10-02']
Output:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
 The pandas dataframe loc accessor is used.
 In the preceding example, the date is used as a string to select arow.
 All sorts of techniques can be used to access rows just as we
cando with a normal dataframe index.
Visualizing time series
Consider the df_power dataframe to visualize the time series dataset:
 The first step is to import the seaborn and
matplotlib libraries: import
matplotlib.pyplot as plt
import seaborn as sns

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 18


AD3301 DEV UNIT 5

sns.set(rc={'figure.figsize':
(11, 4)})
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150

 Next, a line plot of the full-time series of Germany’s daily


electricity Consumption is generated:
df_power[‘Consumption’].plot(linewidth=0.5)
Output:

 In the above screenshot, the y-axis


shows the electricity consumption and the x-axis shows the
year.
However, there are too many datasets to cover all the years.
Using the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes =
df_power[cols_to_plot].plot(marker='.',
alpha=0.5, linestyle='None',figsize=(14,
6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 19


AD3301 DEV UNIT 5

 The output shows that electricity consumption can be


broken downinto two distinct patterns:
o One cluster roughly from 1,400 GWh and above
o Another cluster roughly below 1,400 GWh

 Moreover, solar production is higher in summer and lower


inwinter.

 Over the years, there seems to have been a strong increasing


trendin the output of wind power.

 Investigation of a single year to have a closer look: ax


= df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
Output:

From the preceding screenshot, the consumption of electricity


for2016 can be seen clearly.
The graph shows a drastic decrease in the consumption of
electricityat the end of the year (December) and during August.
The month of December 2016 can be examined with the following
codeblock:
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o',
linestyle='-') ax.set_ylabel('Daily Consumption (GWh)');

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 20


AD3301 DEV UNIT 5

Output:

 As shown in the preceding graph, electricity consumption is


higheron weekdays and lowest at the weekends.

 The consumption can be observed for each day of the month.

 To see how consumption plays out in the last week of


December, itcan be zoomed in further.

 In order to indicate a particular week of December, a specific


daterange can be supplied as shown here:
ax=df_power.loc['2016-12-23':'2016-12-
30','Consumption'].plot(marker='o',
linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');

 As illustrated in the preceding code, the electricity


consumption between 2016-12-23 and 2016-12-30 can be
observed.
Output:

 As illustrated in the preceding screenshot, electricity


consumption was lowest on the day of Christmas, probably
because people were busy partying.

 After Christmas, consumption increased.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 21


AD3301 DEV UNIT 5

5. Explain about grouping time series data. (NOV/DEC 2024)


The data can be grouped by different time periods and box plots
canbe presented:
First, the data can be grouped by months, and then the box plots
canbe used to visualize the data:

fig, axes = plt.subplots(3, 1, figsize=(8, 7), sharex=True)

for name, ax in zip(['Consumption', 'Solar',


'Wind'], axes): sns.boxplot(data=df_power,
x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name) if ax != axes[-1]:
ax.set_xlabel('')
Output:

The preceding plot illustrates that electricity consumption is


generallyhigher in the winter and lower in the summer.
Wind production is higher during the summer.
Moreover, there are many outliers
associated with electricity consumption, wind
production, and solar production.
Next, the consumption of electricity can be grouped by the day of

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 22


AD3301 DEV UNIT 5

the week, and presented in a box plot:


sns.boxplot(data=df_power, x='Weekday Name', y='Consumption');

Output:

 The preceding screenshot shows that electricity consumption


is higher on weekdays than on weekends.

 Interestingly, there are more outliers on the weekdays.

6. Explain about resampling time series data. (NOV/DEC 2024)


 It is often required to resample the dataset at lower or higher
frequencies.
 This resampling is done based on aggregation or grouping operations.
 For example, the data can be resampled based on the weekly meantime
series as follows:
1. To resample the data, the following code
can be used: columns =
['Consumption', 'Wind', 'Solar',
'Wind+Solar']
power_weekly_mean =
df_power[columns].resample('W').mean()
power_weekly_mean

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 23


AD3301 DEV UNIT 5

Output:

 The above screenshot shows that the first row, labeled 2006-01-01,
includes the average of all the data.

 The daily and weekly time series can be plotted to compare the
dataset over the six-month period.
2. Consider the last six months of 2016. Let's start by initializing the
variable: start, end = '2016-01', '2016-06'
3. To plot the graph, the following code can be used:
fig,ax = plt.subplots()
ax.plot(df_power.loc[start:end, 'Solar'],marker='.', linestyle='-',

linewidth=0.5,

label='Daily') ax.plot(power_weekly_mean.loc[start:end,
'Solar'], marker='o', markersize=8, linestyle='-',
label='Weekly MeanResample')

ax.set_ylabel('Solar Production in
(GWh)') ax.legend();
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 24


AD3301 DEV UNIT 5

 The preceding screenshot shows that the weekly mean time series is
increasing over time and is much smoother than the daily time series.

7. Explain the Three-variable contingency tables and beyond. (NOV/DEC


2024)
Causal path models for three variables
 Each arrow linking two variables in a causal path diagram represents the
direct effect of one variable upon the other, controlling all other relevant
variables.
 When assessing the direct effect of one variable upon another, any third
variable which is likely to be causally connected to both variables and
prior to one of them should be controlled.
 Coefficient b in figure 5.8 shows the direct effect of being in a voluntary
association on the belief that most people can be trusted.
 To find its value, focus attention on the proportion who say that most
people can be trusted, controlling for level of qualifications.

Figure 5.8 social trust by membership of voluntary association and level of


qualifications: casual path diagram.

More complex models: going beyond three variables


Logistic regression models
 Regression analysis is a method for predicting the values of a continuously
distributed dependent variable from an independent, or explanatory,
variable.
 The principles behind logistic regression are very similar and the approach

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 25


AD3301 DEV UNIT 5

to building models and interpreting the models is virtually identical.


 However, whereas regression (more properly termed Ordinary Least Squares
regression, or OLS regression) is used when the dependent variable is
continuous, a binary logistic regression model is used when the dependent
variable can only take two values.
 In many examples this dependent variable indicates whether an event
occurs or not and logistic regression is used to model the probability that
the event occurs.
 In the example discussing above, therefore, logistic regression would be used
to model the probability that an individual believes that most people can be
trusted.
When using a single explanatory variable, such as volunteering, the logistic
regression can be written as

8. Explain in detail about longitudinal data.


o It is important to distinguish longitudinal data from the time series data.
Although time series data can provide with a picture of aggregate change,
it is only longitudinal data that can provide evidence of change at the
level of the individual.
o Time series data could perhaps be understood as a series of snapshots
of society, whereas longitudinal research entails following the same
group of individuals over time and linking information about those
individuals from one time point to another.
 Collecting longitudinal data
Prospective and retrospective research designs
o Longitudinal data are frequently collected using a prospective
longitudinal research design, i.e. the participants in a research study are
contacted by researchers and asked to provide information about
themselves and their circumstances on a number of different occasions.
o This is often referred to as a panel study. However, it is not necessary to

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 26


AD3301 DEV UNIT 5

use a longitudinal research design in order to collect longitudinal data


and there is therefore a conceptual distinction between longitudinal data
and longitudinal research.
o Indeed, the retrospective collection of longitudinal data is very common.
In particular, it has become an established method for obtaining basic
information about the dates of key life events such as marriages,
separations and divorces and the birth of any children (i.e. event history
data).
o This is clearly an efficient way of collecting longitudinal data and obviates
the need to re-contact the same group of individuals over a period of time.
o A potential problem is that people may not remember the past accurately
enough to provide good quality data. While some authors have argued
that recall is not a major problem for collecting information about dates
of significant life events, other research suggests that individuals may
have difficulty remembering dates accurately, or may prefer not to
remember unfavorable episodes or events in their lives.
o Large-scale quantitative surveys often combine a number of different
data collection strategies so they do not always fit neatly into the
classification of prospective or retrospective designs. In particular,
longitudinal event history data are frequently collected retrospectively as
part of an ongoing prospective longitudinal study.

9. Develop a case study where time series analysis is used to make a


significant business decision. including data cleaning and resampling
steps.

Here's a detailed case study showing how time series analysis was used to
make a significant business decision, including data cleaning and
resampling steps:

Case Study: Forecasting Sales for Inventory Optimization at a Retail


Chain

Business Context:

Company: FreshMart — a mid-sized grocery retail chain


Challenge: Frequent stockouts and overstocking of perishable items,
especially fruits and vegetables

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 27


AD3301 DEV UNIT 5

Objective: Use time series forecasting to predict weekly sales of perishable


goods and optimize inventory levels

Step 1: Data Collection

FreshMart collected 2 years of daily sales data from 30 stores for 50


perishable items. The data included:

 Date

 Store ID

 Item ID

 Units Sold

 Promotion (Yes/No)

 Price

 Weather data (temperature, rainfall)

Step 2: Data Cleaning


2.1 Handling Missing Values

 Missing sales values: Imputed with a linear interpolation method.

 Missing weather data: Filled using nearest store’s data or average for
the region.

2.2 Outlier Detection

 Applied Z-score method to identify unusually high/low sales values.

 Verified outliers with promotion calendar. Retained valid promotional


spikes; removed true anomalies (e.g., data entry errors).

2.3 Date Time Formatting

 Converted Date column to datetime format and set it as the index.

python

CopyEdit

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 28


AD3301 DEV UNIT 5

import pandas as pd

df['Date'] = pd.to_datetime(df['Date'])

df.set_index('Date', inplace=True)

Step 3: Resampling

Original data was daily, but business decisions are made weekly.

Resampling:

python

CopyEdit

# Resample to weekly frequency, summing units sold

weekly_sales = df.resample('W').sum()

This reduced noise and aligned forecasting with decision-making


cycles.

Step 4: Time Series Analysis & Forecasting

4.1 Trend & Seasonality Analysis

Used seasonal decomposition to analyze components:

python

CopyEdit

from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(weekly_sales['Units_Sold'],
model='additive')

decomposition.plot()

4.2 Stationarity Check

Performed ADF (Augmented Dickey-Fuller) test to check for


stationarity.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 29


AD3301 DEV UNIT 5

python

CopyEdit

from statsmodels.tsa.stattools import adfuller

adf_test = adfuller(weekly_sales['Units_Sold'])

4.3 Model Selection

Compared forecasting models:

 ARIMA

 SARIMA (Seasonal ARIMA)

 Prophet (by Facebook)

SARIMA performed best with lowest RMSE on validation set.

python

CopyEdit

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(weekly_sales['Units_Sold'], order=(1,1,1),


seasonal_order=(1,1,1,52))
results = model.fit()

Step 5: Forecasting & Business Decision

Forecasted next 12 weeks of sales:

python

CopyEdit

forecast = results.get_forecast(steps=12)

Business Decision:

 Revised reorder quantities based on forecasted demand.

 Improved promotion planning by identifying seasonal peaks.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 30


AD3301 DEV UNIT 5

 Reduced inventory waste by 18% in the first quarter post-


implementation.

Outcome

Metric Before After

Inventory Waste $120K/month $98K/month

Stockout Events 210/month 95/month

Forecast Accuracy (MAPE) N/A 7.5%

Key Learnings

 Resampling data aligned the model with the decision cadence.

 Proper data cleaning prevented skewed forecasting.

 Forecasting enabled data-driven inventory planning, leading to real


savings and better customer satisfaction.

10.Discuss how documentation and metadata play a role in the data


cleaning process. Why is it essential to maintain clear records of
cleaning steps taken. (APR/MAY 2025)

 (Role of Metadata)

 Metadata describes the data’s attributes—such as variable names, data


types, sources, collection methods, and units.
 It helps analysts understand what each column means, what values are
valid, and how data should be interpreted.
 Without metadata, it’s difficult to identify errors, outliers, or inconsistencies.

Example: If a dataset includes a column labeled “temp,” metadata clarifies


whether it’s in

Celsius or Fahrenheit, or if it refers to body temperature or ambient


temperature

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 31


AD3301 DEV UNIT 5

 Guiding Cleaning Decisions

 Documentation and metadata guide decisions like handling missing values,


correcting inconsistencies, or merging datasets.
 For instance, metadata might indicate that a column should never have
negative values; if such values are present, they can be flagged and
corrected.

 Tracking Cleaning Steps (Importance of Clear Records)

 Documenting cleaning steps—such as removing duplicates, imputing


missing values, or converting data types—ensures transparency.
 These records allow others (or your future self) to understand what
transformations the raw data has undergone.

Benefits of tracking steps:

 Reproducibility: Others can replicate your analysis.


 Auditability: Enables review and debugging of decisions.
 Version Control: If an error is found, you can revert or adjust earlier steps.
 Collaboration: Team members can understand and build upon each other’s
wo

 Supporting Automation and Pipelines

 Clear documentation and metadata support automated workflows or data


pipelines by clearly defining the expected structure and data quality rules.
 Helps in creating scripts or tools that can apply cleaning steps consistently
across datasets.

 Ensuring Data Integrity

 Cleaning steps sometimes introduce unintended biases or errors. Good


documentation provides a way to evaluate the impact of those decisions.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 32


AD3301 DEV UNIT 5

 It also helps data stewards or stakeholders assess the reliability of the


cleaned data for further analysis or reporting.

Documentation and metadata are not just administrative tasks—they are


foundational to effective data cleaning. They help ensure that data cleaning is
informed, transparent, reproducible, and trustworthy. By maintaining clear
records, you improve data governance, enable collaboration, and uphold the
integrity of your data-driven work

11.Evaluate the impact of multicollinearity in multivariate regression


analysis. How can it affect the interpretation of results , and what techniques
can be employed to detect and mitigate its effects (APR/MAY 2025)

Impact of Multicollinearity

1. Unstable Coefficient Estimates:


When predictors are highly correlated, small changes in the data can lead to
large changes in the estimated coefficients. This makes the model unreliable.
2. Inflated Standard Errors:
The variance of the estimated coefficients increases, resulting in wider
confidence intervals and reduced statistical significance (i.e., larger p-
values), even when the variables may be truly associated with the dependent
variable.
3. Difficulty in Assessing Individual Predictor Importance:
Because multicollinear predictors share information, it's hard to determine
which variable is actually influencing the outcome.
4. Model Interpretability Decreases:
Coefficients can have signs or magnitudes that are counterintuitive or
inconsistent with theoretical expectations.
5. No Impact on Predictive Power (Necessarily):
Multicollinearity doesn't directly affect the model's ability to predict,
especially if the correlated predictors remain in future data. However, it can
impact generalizability if the correlation structure changes.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 33


AD3301 DEV UNIT 5

Detection Techniques

1. Correlation Matrix:
Examine pairwise Pearson correlation coefficients. Values close to ±1 indicate
potential multicollinearity.
2. Variance Inflation Factor (VIF):
Measures how much the variance of a regression coefficient is inflated due to
multicollinearity.
o VIF > 5 or 10 is often considered problematic.
3. Tolerance:
The reciprocal of VIF; values close to 0 suggest high multicollinearity.
4. Condition Index & Eigenvalues (from X'X matrix):
A condition index above 30 may indicate serious multicollinearity issues.

Mitigation Techniques

1. Remove Highly Correlated Predictors:


Drop one of the variables from each correlated pair or set, especially if they
are conceptually similar.
2. Combine Variables:
Use dimensionality reduction techniques like Principal Component
Analysis (PCA) to combine correlated variables into uncorrelated
components.
3. Regularization Techniques:
o Ridge Regression: Penalizes large coefficients and helps mitigate
multicollinearity by shrinking coefficients.
o Lasso Regression: Can shrink some coefficients to zero, effectively
performing variable selection.
4. Centering Variables (Mean Subtraction):
Helps reduce multicollinearity arising from interaction terms or polynomial
terms.
5. Increase Sample Size:
Sometimes multicollinearity is a result of small sample sizes. More data can
help stabilize coefficient estimates.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 34


AD3301 DEV UNIT 5

Multicollinearity does not bias regression estimates, but it inflates variances and
makes results harder to interpret. Proper detection and handling are essential for
building robust, interpretable models. Choosing the appropriate mitigation strategy
depends on the context, goal of the analysis (prediction vs. inference), and data
availability.

12.Examine the application of multivariate time series models, such as vector


auto regression (VAR). How do these models differ from univariate time series
models, and what advantages do they offer in analyzing interrelated time-
dependent variables. (APR/MAY 2025)

Multivariate time series models, such as Vector Autoregression (VAR), are


powerful tools for analyzing and forecasting multiple interrelated time-dependent
variables. These models extend the ideas of univariate time series models (like
ARIMA) to systems of equations, capturing the dynamic relationships between
several variables over time.

Differences Between VAR and Univariate Time Series Models

Univariate Models (e.g., AR,


Feature Multivariate Models (e.g., VAR)
ARIMA)

Variables One dependent variable Multiple interdependent variables

Dynamics of a single time Joint dynamics and interactions among


Focus
series multiple series

Equations Single equation System of equations (one per variable)

Lag
Lags of the same variable Lags of all variables in the system
Structure

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 35


AD3301 DEV UNIT 5

Advantages of Multivariate Time Series Models (e.g., VAR)

1. Captures Interdependence Among Variables:


VAR models account for simultaneous interactions and feedback loops,
which are common in economic, financial, and environmental systems.
2. Improved Forecasting Accuracy:
Including relevant variables improves the forecast of each individual variable
due to shared information.
3. Impulse Response Analysis:
You can trace the effect of a shock to one variable on itself and others over
time, offering insight into dynamic relationships.
4. Granger Causality Testing:
VAR enables statistical testing of whether one time series Granger-causes
another, helping to infer directional relationships.
5. Flexible and Relatively Easy to Estimate:
Compared to more complex structural models, VARs are relatively
straightforward and only require the variables to be stationary (or made
stationary via differencing).

Use Cases and Applications

 Macroeconomics: To study relationships between GDP, inflation, interest


rates, and employment.
 Finance: Modeling joint dynamics of asset prices, exchange rates, or
portfolio risk.
 Policy Analysis: Assessing the impact of monetary or fiscal policy on
multiple economic indicators.
 Energy & Environment: Modeling demand, supply, and prices of energy
products or pollution levels over time.

Limitations

 Parameter Proliferation: With many variables and lags, VAR models can
quickly become overparameterized, especially with limited data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 36


AD3301 DEV UNIT 5

 Stationarity Requirement: Standard VAR models require all variables to be


stationary, which may not always be feasible.
 No Structural Interpretation (without constraints): Basic VAR is
atheoretical; identifying true causal relationships often requires structural
VAR (SVAR) or other constraints.

Multivariate time series models like VAR are essential tools when analyzing
systems where variables influence each other dynamically over time. Unlike
univariate models, which view each variable in isolation,

VAR models unlock insights into interdependencies, feedback effects, and joint
forecasting, making them highly valuable in many fields where variables are
intrinsically linked.

13. Design a multivariate analysis that incorporates a third variable


into an existing bivariate dataset and explain the implication this may
have.

Multivariate Analysis Design:

Let’s consider an existing bivariate dataset involving:

 Variable X: Hours Studied


 Variable Y: Exam Score

This dataset shows a positive correlation: as Hours Studied increases, Exam Score
tends to increase.

Now, we introduce a third variable (Z):

 Variable Z: Sleep Duration (average sleep per night before the exam)

We now have a multivariate dataset with three variables:

 X (Hours Studied),
 Y (Exam Score), and
 Z (Sleep Duration)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 37


AD3301 DEV UNIT 5

Analysis Approach:

We can apply multiple linear regression:

Exam Score (Y)=β0+β1(Hours Studied)+β2(Sleep Duration)+ε\text{Exam Score (Y)} =


\beta_0 + \beta_1(\text{Hours Studied}) + \beta_2(\text{Sleep Duration}) +
\varepsilonExam Score (Y)=β0+β1(Hours Studied)+β2(Sleep Duration)+ε

This model allows us to assess the individual and combined effects of both
predictors (X and Z) on the outcome (Y).

Implications of Adding the Third Variable:

1. Refined Insights:
o The effect of Hours Studied on Exam Score might be overestimated in
the bivariate model if Sleep Duration is not considered.
o For example, students who study more but sleep less may not perform
as well, suggesting diminishing returns after a certain point.
2. Interaction Effects:
o We can test interaction: Does the effect of studying differ based on
sleep?
For instance, an interaction term X⋅ZX \cdot ZX⋅Z can be added to
capture how study effectiveness varies with sleep quality.
3. Confounding Control:
o Sleep Duration may be a confounding variable that influences both
Hours Studied and Exam Score. Including it in the model provides
more accurate causal interpretation.
4. Model Fit Improvement:
o Multivariate analysis often improves predictive power (higher R²),
because it explains more variance in the dependent variable.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 38

You might also like