Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views341 pages

Ds Notes-Unit 1, II and III Upto Part1

The document provides an introduction to data science, covering its definition, advantages, and disadvantages, as well as the characteristics of big data. It outlines the data science process, including steps such as data retrieval, preparation, exploration, modeling, and presentation, and discusses types of data analytics: descriptive, predictive, and prescriptive. Additionally, it addresses data types, measurement scales, and the importance of data preprocessing for ensuring data quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views341 pages

Ds Notes-Unit 1, II and III Upto Part1

The document provides an introduction to data science, covering its definition, advantages, and disadvantages, as well as the characteristics of big data. It outlines the data science process, including steps such as data retrieval, preparation, exploration, modeling, and presentation, and discusses types of data analytics: descriptive, predictive, and prescriptive. Additionally, it addresses data types, measurement scales, and the importance of data preprocessing for ensuring data quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 341

UNIT-1: Contents

UNIT I – Introduction to Data Science:


•Data science, Characteristics of Bigdata, Different steps in Data science
process, Types of Data analytics.
•Descriptive Analysis: Data Types and Scales, Types of Data Measurement
Scales, Measures of Central Tendency, Measures of Variation, Similarity, and
dissimilarity measures,
•Data preprocessing: Data Cleaning, Data Integration, Data Transformation.
Data science, Characteristics of Bigdata,
Different steps in Data science process, Types
of Data analytics.
1.What is Data Science?

• A multi-disciplinary
field that uses
scientific methods,
processes,
algorithms and
systems to extract
knowledge and
insights from
structured and
unstructured data.
Advantages and Disadvantages of Data
Science
Advantages

•It helps us to get insights from the historical data with its powerful tools.
•It helps to optimize the business, hire the right persons and generate more revenue as using
data science helps you to make better future decisions for the business.
•Companies can develop and market their products better as they can better select their
target customers.
•Introduction to Data Science also helps consumers search for better goods, especially in e-
commerce sites based on the data-driven recommendation system.
Disadvantages

•The disadvantages are generally when data science is used for customer profiling and
infringement of customer privacy, as their information, such as transactions, purchases, and
subscriptions, is visible their parent companies. The information obtained using data science
can be used against a certain group, individual, country or community.
2.What is BIG DATA?

• Big Data is a collection of data that is


huge in volume, yet growing
exponentially with time.
• It is a data with so large size and
complexity that none of traditional data
management tools can store it or process
it efficiently.
• Big data is also a data but with huge size.
• Refer to various bigdata sources from
figure
Big Data
Types Of Big
Data:
Traits/Characteristics Of Big Data
Volume:
• The name Big Data itself is related to an
DATA GROWTH
enormous size. Big Data is a vast 'volumes'
of data generated from many sources
daily, such as business processes,
machines, social media platforms,
networks, human interactions, and many
more.

• Facebook can generate approximately


a billion messages, 4.5 billion times that
the "Like" button is recorded, and more
than 350 million new posts are uploaded
each day. Big data technologies can handle
large amounts of data.
Variety:
• Big Data can be structured,
unstructured, and semi-
structured that are being
collected from different
sources.
• in the past, data is only be
collected
from databases and sheets
• But these days the data will
comes in array forms, that
are PDFs, Emails, audios, SM
posts, photos, videos, etc.
Veracity:

• Veracity means how much the


data is reliable.
Big Data Veracity refers to the
biases, noise and abnormality in
data.
• Is the data that is being stored,
and mined meaningful to the
problem being analyzed.
• veracity in data analysis is the
biggest challenge when compares
to things like volume and velocity.
Velocity and value:

• Velocity – The term 'velocity' refers to the speed of


generation of data. How fast the data is generated
and processed to meet the demands, determines
real potential in the data.
• Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, mobile devices, etc.
• The flow of data is massive and continuous.
• Value- sits at the top of the big data pyramid.
• This refers to the ability to transform a tsunami of
data into business.
• it is not the data that we process or store.
• It is valuable and reliable data that we store,
process, and also analyze.
3.Different steps in Data science
process
Data Science
Steps in Data Science Process-
six steps
1. Setting research goal
2. Retrieving data
3. Data Preparation
4. Data Exploration
5. Data Modelling
6. Presentation and Automation
Steps in Data Science Process-
six steps
Step 1: Defining Research goal and
creating project Charter
Step 1: Defining Research goal and
creating project Charter
Step 2- Retrieving data
Data can be stored in many forms, ranging from simple text files to
tables in a database.
The objective now is acquiring all the data you need. This may be
difficult, and even if you succeed, data is often like a diamond in the
rough: it needs polishing to be of any use to you.
Step 3: Data Preparation: Cleansing, integrating, and
transforming data

The data received from the data retrieval phase


is likely to be “a diamond in the rough.” Your
task now is to sanitize and prepare it for use in
the modelling and reporting phase.
Step 4-Data exploration
Step 5-Data modeling
Step 5-Data modelling
Step 5-Data Modelling
Step 6-Presentation and
automation
4. Types of Data analytics -DESCRIPTIVE
ANALYTICS

Types of Data Analytics:


1.Descriptive analytics,
2.Predictive analytics, and
3. Prescriptive analytics.
Types of data Analytics-PREDICTIVE
ANALYTICS

1. It aims to predict the probability of occurrence of a future event such as forecasting demand for
products/services, customer churn, employee attrition, loan defaults, fraudulent transactions,
insurance claim, and stock market fluctuations.
2. predictive analytics is used for predicting what is likely to happen in the future.
Types of data Analytics- PRESCRIPTIVE
ANALYTICS

1. Prescriptive analytics is the highest level of analytics capability which is used


for choosing optimal actions once an organization gains insights through
descriptive and predictive analytics.
2. In many cases, prescriptive analytics is solved as a separate optimization
problem. Prescriptive analytics assists users in finding the optimal solution to a
problem or in making the right choice/decision among several alternatives.
3. Operations Research (OR) techniques form the core of prescriptive analytics.
Apart from operations research techniques, machine learning algorithms,
metaheuristics, and advanced statistical models are used in prescriptive
analytics.
Types of data Analytics-Tools
• The most frequently used predictive analytics techniques are regression,
logistic regression, classification trees, forecasting, K-nearest neighbours,
Markov chains, random forest, boosting, and neural networks.

• The frequently used tools in prescriptive analytics are linear programming,


integer programming, multi-criteria decision-making models such as goal
programming and analytic hierarchy process, combinatorial optimization,
non-linear programming, and meta-heuristics.
Data types and scales
Data is classified into different categories based on data structure and scale of measurement of the
variables
Structured and Unstructured Data

Structured data means that the data is described in a matrix form with
labelled rows and columns.

Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
For example, e-mails, click streams, textual data, images (photos and
images generated by medical devices), log data, and videos. Machine
generated data such as images generated by satellite, magnetic
resonance imaging (MRI), electrocardiogram (ECG) and thermography
are few examples of unstructured data.
Data types and scales
Cross-sectional, Time Series, and Panel Data

the data is grouped into the following three classes:


1.Cross-Sectional Data: A data collected on many variables of interest at the same
time or duration of time is called cross-sectional data. For example, consider data on
movies such as budget, box-office collection, actors, directors, genre of the movie
during year 2017.
2.Time Series Data: A data collected for a single variable such as demand for
smartphones collected over several time intervals (weekly, monthly, etc.) is called a
time series data.
3.Panel Data: Data collected on several variables (multiple dimensions) over several
time intervals is called panel data (also known as longitudinal data). Example of a
panel data is data collected on variables such as gross domestic product (GDP), Gini
index, and unemployment rate for several countries over several years.
TYPES OF DATA MEASUREMENT SCALES

Structured data can be either numeric or alpha numeric and may


follow different scales of measurement (level of measurement).

It is important to understand the type of variables within the


data with
respect to the measurement scale since the model specification
while building analytics models such as regression may depend
on the scale of measurement.
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Ordinal
• Continuous and Discontinues
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled

34
Types of attributes

35
Nominal Attributes-related to names

36
37
38
39
POPULATION AND SAMPLE

Population is the set of all possible observations (often called cases, records, subjects or data points)
for a given context of the problem. The size of the population can be very large in many cases.

The sample is a logical subset of the population, which mimics the population.
Selecting a relevant sample out of the population is challenging, but it makes
analysis faster, precise, and economical.
There are standard guidelines from statisticians to calculate the relevant sample
size, appropriate sampling methodology, and tool to analyze sampled data.
MEASURES OF CENTRAL TENDENCY

Measures of central tendency are the measures that are used for describing the data using a single value. Mean,
median and mode are the three measures of central tendency and are frequently used to compare different data
sets.
Measures of central tendency help users to summarize and comprehend the data.
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF VARIATION
MEASURES OF VARIATION
MEASURES OF VARIATION
MEASURES OF VARIATION
Data science
UNIT-I
Similarity and Dissimilarity
measures

51
Similarity and Dissimilarity

Similarity

Proximity: refers to a similarity or dissimilarity

52
Similarity and Dissimilarity
Applications:
• Web search
• Computer vision:
• Image Processing
• Natural language processing
• Clustering outliers

53
Similarity / Proximity Measures

• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numeric attributes 1. Scaled 2. Ratio
• Mixed attributes

54
• Similarity quantifies how alike two objects (data points) are. The
higher the similarity, the more alike the objects.
Data Science
UNIT-1

Data Preprocessing
I-Data Cleaning,
II-Data Integration,
III-Data Reduction,
III-Data Transformation and Data
Discretization,

59
60
Data Quality: Why Preprocess
the Data?
• There are many factors/measures comprising data quality
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not, errors from instruments , data transmission
errors
• Completeness: not recorded, unavailable, user interested attributes may not be
available causing unfilled data
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update? Several sales representatives, however, fail to submit their
sales records on time at the end of the month.
• For a period of time following each month, the data stored in the database are
incomplete.
• However, once all of the data are received, it is correct.
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
• Suppose that a database, at one point, had several errors, all of which have since
been corrected.
• The data also use many accounting codes, which the sales department does not
know how to interpret. Even though the database is 61
Data Mining as Knowledge
Discovery
•Data cleaning - to remove noise or
irrelevant data
•Data integration - where multiple data
sources may be combined
•Data selection- where data relevant to the
analysis task are retrieved from the database
•Data transformation -where data are
transformed or consolidated into forms
appropriate for mining by
•performing summary or aggregation
operations
•Data mining - an essential process where
intelligent methods are applied in order to
extract data patterns
•Pattern evaluation to identify the truly
interesting patterns representing knowledge
based
•Knowledge presentation -where
visualization and knowledge representation
techniques are used to present
62
Major Tasks/Steps in Data
Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data,
• identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

63
Data Science
UNIT-1

Data Preprocessing
I-Data Cleaning,
II-Data Integration,
III-Data Reduction,
III-Data Transformation and Data
Discretization,

64
65
Data Quality: Why Preprocess
the Data?
• There are many factors/measures comprising data quality
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not, errors from instruments , data transmission
errors
• Completeness: not recorded, unavailable, user interested attributes may not be
available causing unfilled data
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update? Several sales representatives, however, fail to submit their
sales records on time at the end of the month.
• For a period of time following each month, the data stored in the database are
incomplete.
• However, once all of the data are received, it is correct.
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
• Suppose that a database, at one point, had several errors, all of which have since
been corrected.
• The data also use many accounting codes, which the sales department does not
know how to interpret. Even though the database is 66
Data Mining as Knowledge
Discovery
•Data cleaning - to remove noise or
irrelevant data
•Data integration - where multiple data
sources may be combined
•Data selection- where data relevant to the
analysis task are retrieved from the database
•Data transformation -where data are
transformed or consolidated into forms
appropriate for mining by
•performing summary or aggregation
operations
•Data mining - an essential process where
intelligent methods are applied in order to
extract data patterns
•Pattern evaluation to identify the truly
interesting patterns representing knowledge
based
•Knowledge presentation -where
visualization and knowledge representation
techniques are used to present
67
Major Tasks/Steps in Data
Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data,
• identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

68
I-Data Cleaning

69
I- Data Cleaning: Incomplete
(Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time
of entry
• not register history or changes of the data
• Missing data may need to be inferred

70
import pandas as pd equipment malfunction
import numpy as np

# Sensor data with missing values during malfunction


data = {'Hour': [1, 2, 3, 4, 5],
'Temp': [72, 71, np.nan, np.nan, 73]} # Missing at hours 3-4
df = pd.DataFrame(data)

# Forward fill to approximate missing values


df['Temp_filled'] = df['Temp'].fillna(method='ffill')
print(df)

71
2. Inconsistent Data Deletion Example
Scenario: Negative age values removed

# Dataset with invalid age


data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, -30, 35]} # Bob's age is invalid

df = pd.DataFrame(data)
clean_df = df[df['Age'] > 0] # Remove inconsistent record
print(clean_df)

72
3. Data Entry Misunderstanding Example
Scenario: Mixed units (kg/lbs) causing missing values

# Weight data with unit confusion


data = {'ID': [1, 2, 3],
'Weight': [68, '150lbs', np.nan]} # Missing for ID 3

df = pd.DataFrame(data)
# Convert all to kg and fill missing with average
df['Weight_kg'] = df['Weight'].replace({'150lbs': 68.04}) # 150lbs = 68.04kg
df['Weight_kg'] = pd.to_numeric(df['Weight_kg'], errors='coerce').fillna(68)
print(df)

73
4. Unimportant Data Example
Scenario: Optional salary field left empty

# Job application data


data = {'Applicant': ['A', 'B', 'C'],
'Salary': [80000, np.nan, 75000]} # Applicant B didn't provide

df = pd.DataFrame(data)
# Fill missing with median salary
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
print(df)

74
5. Unregistered Changes Example
Scenario: Missed price updates

# Product price history


data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'Price': [10.99, np.nan, 12.99]} # Missing price change on 2023-01-02

df = pd.DataFrame(data)
# Linear interpolation for missing price
df['Price'] = df['Price'].interpolate()
print(df)

75
# Customer data with missing income
data = {'Name': ['Mike', 'Jenny'],
'Age': [40, 20],
'Sex': ['Male', 'Female'],
'Income': [150000, np.nan], # Jenny's income missing
'Class': ['Big spender', 'Regular']}

df = pd.DataFrame(data)

# Solution 1: Class-specific mean imputation


df['Income'] = df.groupby('Class')['Income'].transform(
lambda x: x.fillna(x.mean())

# Solution 2: Add missingness indicator


df['Income_Missing'] = df['Income'].isna().astype(int)

print(df)

76
Data Cleaning: How to Handle
Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)
—not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision
tree, or use regression to fill

77
1. Ignoring the Tuple (Complete Case Analysis)
When: Missing class label in classification

import pandas as pd
import numpy as np

data = {'Age': [25, 30, 35],


'Income': [50_000, 75_000, 110_000],
'Class': ['A', np.nan, 'B']} # Missing class for 2nd record

df = pd.DataFrame(data)
clean_df = df.dropna(subset=['Class']) # Remove row with missing class
print(clean_df)

78
2. Manual Filling
When: Small dataset with known values

data = {'Product': ['A', 'B', 'C'],


'Price': [10.99, np.nan, 19.99]} # Missing price for Product B

df = pd.DataFrame(data)
# Manually fill based on knowledge:
df.loc[1, 'Price'] = 15.99 # We know Product B should be $15.99
print(df)

79
3. Global Constant
When: Categorical data or placeholder needed

data = {'City': ['NY', 'LA', np.nan, 'Chicago']}

df = pd.DataFrame(data)
df['City'] = df['City'].fillna('Unknown')
print(df)

80
4. Attribute Mean
When: Numerical data with random missingness

data = {'Test1': [85, 76, np.nan, 92],


'Test2': [78, np.nan, 88, 90]}

df = pd.DataFrame(data)
df['Test1'] = df['Test1'].fillna(df['Test1'].mean())
df['Test2'] = df['Test2'].fillna(df['Test2'].median()) # Using median if outliers exist
print(df)

81
Group by
import pandas as pd
# Sample data
data = {
'Student': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Science', 'Science', 'Math'],
'Score': [85, 78, 90, 88, 82]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

82
Use groupby() to find average
score per subject:
# Group by Subject and calculate average score
avg_score = df.groupby('Subject')['Score'].mean()
print("\nAverage Score by Subject:")
print(avg_score)

df.groupby(['Student', 'Subject'])['Score'].mean()

83
5. Class-Specific Mean (Smarter)
When: Data grouped by categories

data = {'Grade': ['A', 'B', 'A', 'B', 'A'],


'Score': [95, 80, np.nan, 75, np.nan]}

df = pd.DataFrame(data)
df['Score'] = df.groupby('Grade')['Score'].transform(
lambda x: x.fillna(x.mean()))
print(df)
# A's NaN → 95 (mean of 95), B's NaN → 77.5 (mean of 80,75)

84
6. Most Probable Value (Regression)
When: Strong correlations exist

from sklearn.linear_model import LinearRegression

data = {'X': [1, 2, 3, 4, 5],


'Y': [2, 4, np.nan, 8, 10]} # Y is roughly 2*X

df = pd.DataFrame(data)
# Train on complete cases
model = LinearRegression()
model.fit(df[['X']][df['Y'].notna()], df['Y'][df['Y'].notna()])
# Predict missing
df.loc[df['Y'].isna(), 'Y'] = model.predict([[3]])
print(df) # Missing Y at X=3 becomes 6

85
7. Decision Tree Imputation
When: Complex relationships

from sklearn.tree import DecisionTreeRegressor

data = {'Age': [25, 30, 35, 40],


'Experience': [1, 3, np.nan, 8],
'Salary': [50_000, 60_000, 75_000, 90_000]}

df = pd.DataFrame(data)
# Train model
model = DecisionTreeRegressor()
train = df[df['Experience'].notna()]
model.fit(train[['Age', 'Salary']], train['Experience'])
# Predict missing
df.loc[df['Experience'].isna(), 'Experience'] = model.predict(
df[df['Experience'].isna()][['Age', 'Salary']])
print(df) # Experience for Age=35/Salary=75k predicted

86
Data Cleaning:
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

87
87
1. Faulty Instrument Example (Sensor Noise)
import pandas as pd
import numpy as np

# Temperature sensor data (24 hourly readings with some spikes)


data = {'Time': range(24),
'Temp': [72.1, 72.0, 72.2, 500.0, 71.9, 72.1, 72.0, -999,
72.2, 72.3, 72.1, 72.0, 72.4, 72.2, 72.3, 72.1,
72.5, 72.4, 72.3, 72.2, 72.6, 72.5, 72.4, 72.3]}

df = pd.DataFrame(data)

# Clean by clipping values to reasonable range (70-80°F)


df['Clean_Temp'] = np.clip(df['Temp'], 70, 80)

# Alternative: Replace outliers with median


median_temp = df.loc[(df['Temp'] >= 70) & (df['Temp'] <= 80), 'Temp'].median()
df['Median_Fixed'] = np.where((df['Temp'] < 70) | (df['Temp'] > 80),
median_temp,
df['Temp'])

print(df[['Time', 'Temp', 'Clean_Temp', 'Median_Fixed']].head(10)) # Show first 10


rows 88
2. Data Entry Problem Example (Typos)

# Product prices with typos


data = {'Product': ['A', 'B', 'C', 'D'],
'Price': [15.99, 1.599, 25.99, 2.599]} # Decimals misplaced

df = pd.DataFrame(data)

# Fix decimal positions for prices < 10


df['Clean_Price'] = np.where(df['Price'] < 10,
df['Price'] * 10,
df['Price'])
print(df)

89
3. Transmission Error Example (Corrupted Data)

# Data with transmission errors


data = {'ID': [101, 102, 103, 104],
'Value': ['25.5', '3#.1', '18.2', '2A.7']} # Corrupted
characters

df = pd.DataFrame(data)

# Clean non-numeric characters


df['Clean_Value'] = pd.to_numeric(df['Value'], errors='coerce')
print(df)

90
4. Technology Limitation Example (Precision Errors)

# Measurements with floating-point errors


data = {'Measurement': [0.1 + 0.1 + 0.1, 0.3, 0.2 + 0.1]} #
Should all be 0.3

df = pd.DataFrame(data)

# Round to handle floating-point precision


df['Clean_Measurement'] = np.round(df['Measurement'], 2)
print(df)

91
5. Naming Inconsistency Example

# Product categories with inconsistencies


data = {'Product': ['Shirt', 'shirt', 'SHIRT', 'Pants', 'pants'],
'Sales': [100, 120, 80, 150, 130]}

df = pd.DataFrame(data)

# Standardize categories
df['Clean_Category'] = df['Product'].str.lower().str.strip()
sales_summary = df.groupby('Clean_Category')['Sales'].sum()
print(sales_summary)

92
6. Duplicate Records Example

# Customer records with duplicates


data = {'ID': [1, 2, 2, 3, 4, 4, 4],
'Name': ['A', 'B', 'B', 'C', 'D', 'D', 'D'],
'Purchase': [50, 30, 30, 75, 100, 100, 120]} # Last is
conflicting

df = pd.DataFrame(data)

# Remove exact duplicates


deduped = df.drop_duplicates()

# For conflicting duplicates, keep last


resolved = df.drop_duplicates(subset=['ID', 'Name'],
keep='last')
print("Exact duplicates removed:")
print(deduped)
print("\nConflicts resolved:")
print(resolved)

93
7. Incomplete Data Example

# Survey data with missing responses


data = {'Respondent': [1, 2, 3, 4, 5],
'Age': [25, 30, None, 40, None],
'Satisfaction': [4, None, 5, None, 3]}

df = pd.DataFrame(data)

# Multiple imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
df[['Age', 'Satisfaction']] = imputer.fit_transform(df[['Age',
'Satisfaction']])
print(df)

94
8. Inconsistent Data Example

# Mixed date formats


data = {'Event': ['Meeting', 'Conference', 'Workshop'],
'Date': ['2023-05-15', '15/05/2023', 'May 15 2023']}

df = pd.DataFrame(data)

# Standardize dates
df['Clean_Date'] = pd.to_datetime(df['Date'], errors='coerce')
print(df)

95
Noise Cleaning Pipeline

def clean_dataset(df):
# Handle numeric noise
for col in df.select_dtypes(include=np.number):
df[col] = np.clip(df[col],
df[col].quantile(0.01),
df[col].quantile(0.99))

# Standardize text
for col in df.select_dtypes(include='object'):
df[col] = df[col].str.lower().str.strip()

# Remove duplicates
df = df.drop_duplicates()

return df

96
Data Cleaning: How to Handle
Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering/Outlier
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)

97
1. Binning Method (Equal-Frequency)

import pandas as pd
import numpy as np

# Create noisy data


data = {'Values': [12, 15, 18, 120, 22, 25, 28, 31, 200, 35, 38,
41]}
df = pd.DataFrame(data)

# Equal-frequency binning (3 bins)


df['Bin'] = pd.qcut(df['Values'], q=3, labels=False)

# Smooth by bin mean


df['Bin_Mean'] = df.groupby('Bin')['Values'].transform('mean')

# Smooth by bin median


df['Bin_Median'] = df.groupby('Bin')
['Values'].transform('median')

print(df.sort_values('Values'))
98
2. Regression Smoothing

from sklearn.linear_model import LinearRegression

# Prepare data
X = df.index.values.reshape(-1, 1)
y = df['Values'].values

# Fit linear regression


model = LinearRegression()
model.fit(X, y)

# Add smoothed values


df['Regression_Smoothed'] = model.predict(X)

# Visual comparison
import matplotlib.pyplot as plt
plt.scatter(df.index, df['Values'], label='Original')
plt.plot(df.index, df['Regression_Smoothed'], color='red',
label='Smoothed')
plt.legend()
plt.show()
99
3. Clustering/Outlier Detection

from sklearn.cluster import DBSCAN


from sklearn.preprocessing import StandardScaler

# Scale data
scaler = StandardScaler()
scaled_values = scaler.fit_transform(df[['Values']])

# Cluster with DBSCAN (eps=0.5 detects tight clusters)


clustering = DBSCAN(eps=0.5,
min_samples=2).fit(scaled_values)
df['Cluster'] = clustering.labels_

# Identify outliers (cluster = -1)


outliers = df[df['Cluster'] == -1]
print("Detected outliers:")
print(outliers[['Values']])

# Remove outliers
clean_df = df[df['Cluster'] != -1]
100
4. Human-in-the-Loop Inspection

def flag_suspicious(df, column, threshold=2.5):


"""Flag values > threshold standard deviations from mean"""
z_scores = (df[column] - df[column].mean()) /
df[column].std()
df['Suspicious'] = abs(z_scores) > threshold
return df

df = flag_suspicious(df, 'Values')
print("Values needing manual review:")
print(df[df['Suspicious']][['Values']])

# After human review, you might:


# df.loc[df['Suspicious'] & human_confirmed_outlier, 'Values']
= np.nan
# Then fill NA using other methods

101
Data Cleaning: Handling Noisy Data:
Binning
• “Data smoothing is the technique to handle noisy data
• “smooth” out the data to remove the noise
• data smoothing techniques.
• Binning
• Regression
• Outlier analysis
• Binning:
• This method works on sorted data
• The sorted data is divided into equal frequency buckets/bins
• Binning is of three types:
1. smoothing by bin means
• each value in a bin is replaced by the mean value of the bin.
2. smoothing by bin medians
• each bin value is replaced by the bin median.
3. smoothing by bin boundaries
• the minimum and maximum values in a given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value. 102
Binning Methods for Noisy Data Smoothing
1. Preparing Noisy Data

import pandas as pd
import numpy as np

# Create noisy dataset


np.random.seed(42)
data = {'Values': np.concatenate([
np.random.normal(50, 5, 30), # Main cluster
[120, 3, 130] # Outliers
])}
df =
pd.DataFrame(data).sort_values('Values').reset_index(drop=Tr
ue)
print("Original Data:")
print(df.head(10))

103
2. Equal-Frequency Binning

# Create 4 equal-frequency bins


df['Bin'] = pd.qcut(df['Values'], q=4, labels=False,
duplicates='drop')
bin_stats = df.groupby('Bin')['Values'].agg(['min', 'max', 'mean',
'median'])
print("\nBin Statistics:")
print(bin_stats)

104
3. Smoothing by Bin Means

df['Bin_Mean'] = df.groupby('Bin')['Values'].transform('mean')
print("\nSmoothing by Bin Means:")
print(df[['Values', 'Bin', 'Bin_Mean']].head(10))

4. Smoothing by Bin Medians

df['Bin_Median'] = df.groupby('Bin')
['Values'].transform('median')
print("\nSmoothing by Bin Medians:")
print(df[['Values', 'Bin', 'Bin_Median']].sample(10))

105
5. Smoothing by Bin Boundaries

# Calculate bin boundaries


bin_boundaries = df.groupby('Bin')['Values'].agg(['min', 'max'])

def replace_with_boundary(row):
bin_id = row['Bin']
val = row['Values']
boundaries = bin_boundaries.loc[bin_id]
return boundaries['min'] if abs(val - boundaries['min']) <
abs(val - boundaries['max']) else boundaries['max']

df['Bin_Boundary'] = df.apply(replace_with_boundary, axis=1)


print("\nSmoothing by Bin Boundaries:")
print(df[['Values', 'Bin', 'Bin_Boundary']].tail(10))

106
6. Visual Comparison

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.scatter(df.index, df['Values'], label='Original', alpha=0.5)
plt.scatter(df.index, df['Bin_Mean'], label='Bin Means',
alpha=0.7)
plt.scatter(df.index, df['Bin_Median'], label='Bin Medians',
alpha=0.7)
plt.scatter(df.index, df['Bin_Boundary'], label='Bin Boundaries',
alpha=0.7)
plt.legend()
plt.title("Binning Smoothing Techniques Comparison")
plt.show()

107
Data Cleaning: Binning methods

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same number
of samples
• Good data scaling
• Managing categorical attributes can be tricky

108
Binning Methods for Data Smoothing
1. Equal-Width (Distance) Binning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample data with outliers


data = {'Values': np.concatenate([
np.random.normal(50, 10, 1000), # Main distribution
np.random.normal(200, 10, 10) # Outliers
])}
df = pd.DataFrame(data)

# Equal-width binning function


def equal_width_binning(data, col, num_bins=5):
min_val = data[col].min()
max_val = data[col].max()
width = (max_val - min_val)/num_bins
bins = [min_val + i*width for i in range(num_bins+1)]
return pd.cut(data[col], bins=bins, include_lowest=True)

df['Equal_Width_Bin'] = equal_width_binning(df, 'Values', 5)


109
2. Equal-Depth (Frequency) Binning

# Equal-frequency binning function


def equal_frequency_binning(data, col, num_bins=5):
return pd.qcut(data[col], q=num_bins, duplicates='drop')

df['Equal_Freq_Bin'] = equal_frequency_binning(df, 'Values', 5)

# Visualize
plt.figure(figsize=(10,5))
plt.hist(df['Values'], bins=50, alpha=0.5, label='Original Data')
for boundary in df['Equal_Freq_Bin'].cat.categories.left:
plt.axvline(x=boundary, color='green', linestyle='--',
alpha=0.5)
plt.title('Equal-Frequency Binning (Balanced Points per Bin)')
plt.legend()
plt.show()

110
3. Comparison of Methods

# Create comparison table


comparison = pd.DataFrame({
'Method': ['Equal-Width', 'Equal-Frequency'],
'Bin Ranges': [
str(df['Equal_Width_Bin'].cat.categories),
str(df['Equal_Freq_Bin'].cat.categories)
],
'Counts per Bin': [
df['Equal_Width_Bin'].value_counts().to_dict(),
df['Equal_Freq_Bin'].value_counts().to_dict()
]
})

print("Binning Method Comparison:")


print(comparison)

111
4. Handling Skewed Data
# Create right-skewed data
skewed_data = np.concatenate([
np.random.exponential(scale=10, size=900),
np.random.normal(50, 5, 100)
])
df_skewed = pd.DataFrame({'Values': skewed_data})

# Apply both methods


df_skewed['Width_Bin'] = equal_width_binning(df_skewed,
'Values')
df_skewed['Freq_Bin'] = equal_frequency_binning(df_skewed,
'Values')

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
df_skewed['Width_Bin'].value_counts().sort_index().plot(kind='
bar', ax=ax1)
ax1.set_title('Equal-Width Binning (Skewed Data)')
df_skewed['Freq_Bin'].value_counts().plot(kind='bar', ax=ax2)
ax2.set_title('Equal-Frequency Binning (Skewed Data)')
plt.show()
112
Data Cleaning: Binning Example:
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Partition into (equal-frequency-length 3) bins:
• Bin1: 4,8,15 (max=15, min=4)
• Bin 2: 21, 21, 24 (max=24, min=21)
• Bin 3: 25, 28, 34 (max=34,min=25)
• Smoothing by bin means:
• Bin1: 9,9,9. (mean of Bin1=4+8+15/3 =9)
• Bin 2: 22, 22, 22 (mean of Bin2=21+21+24/3 =22)
• Bin 3: 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4,4,15. ( 4 is closer to min=4,replace 4 by 4. 8 is closer to min=4
so replace 8 by 4, 15 is closer to max=15,replace 15 by 15)
• Bin 2: 21, 21, 24 Use |x2-x1| as
• Bin 3: 25, 25, 34 closeness measure

113
Data Cleaning: Handling Noisy Data:
Regression
• Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
• Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict
the other.
• Y=mX+c
• Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to
a multidimensional surface.
• Y=m1X1+m2X2+c
• Example:
• Using various normal equations of X 0 1 2 3 4

Stats the best fitted line can be Y=2X. Y 0 2 4 5.6 8


• Use Y=2X to predict the correct value at X=3
Error data
Which is 6 and replace(smooth) 5.6 by 6

114
Data Cleaning: Handling Noisy Data: Outlier
analysis
Outlier analysis: Outliers may be detected by clustering,
•where similar values are organized into groups, or “clusters.”
•Intuitively, values that fall outside of the set of clusters may be considered outliers

X 0 1 2 3 4
Y=x 0 1 0.5 3 4
Y=x2 0 1 0.5 9 16 Outlier=0.5

•When Y=X and Y=X2 are fitted then the


Points satisfying these two equations
are taken as two clusters and the errored value 0.5
Which is not into any cluster is identified as outlier .

115
Data Integration:

Data integration:
Combines data from multiple sources into a coherent store.
Approaches in Data Integration
1. Entity identification problem:
2. Tuple Duplication:
3. Detecting and resolving data value conflicts
4. Redundancy and Correlation Analysis

116
1. Entity identification problem
• Schema integration:
Mismatching attribute names
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
e.g., A.cust-id  B.cust-#
Integrate metadata from different sources
• Object matching: Mismatching structure of data
Ex: Discount issues
Currency type

117
2. Tuple
Duplication:

Name DOB Branch Occupation Address


• The use of denormalized tables is another A 25 HYD Govt TPG
source of data redundancy.
Inconsistencies often arise between B 30 TH Govt RJY
various duplicates due to inaccurate data A 25 HYD Private TPG
entry.
D 30 IBP Private RJY

118
3. Detecting and resolving data
value conflicts

For the same real world entity, attribute values from different sources are
different.
Possible reasons:
different representations, Ex: Total sales for month single store/all stores
different scales, e.g., metric vs. British units (e.g., GPA in US and
China)

119
4. Redundancy and Correlation Analysis

• Redundant data occur often when integration of multiple databases


• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

120
4. Redundancy and Correlation Analysis
4. Redundancy and Correlation Analysis
Correlation
Correlation analysis is a method of statistical evaluation used to study the strength of a
relationship between two, numerically measured, continuous variables.
 Computed as Correlation co-efficient
 Value ranges between – (-1) to (+1)
 Positively Correlated, Negatively correlated,
Not correlated
 The strength of a correlation indicates how
strong the relationship is between the two
variables. The strength is determined by the
numerical value of the correlation
Example problem:
Business problem: The healthcare industry want to develop a
medication to control glucose levels. For this it want to study does
age
have impact on raise in glucose levels
GLUCOSE
SUBJECT AGE X XY X2 Y2 From our table:
LEVEL Y
Σx = 247
1 43 99 4257 1849 9801
Σy = 486
2 21 65 1365 441 4225 Σxy = 20,485
3 25 79 1975 625 6241 Σx2 = 11,409
4 42 75 3150 1764 5625
Σy2 = 40,022
n is the sample
5 57 87 4959 3249 7569
size=6
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

The correlation coefficient =


6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]. = 0.5298 (strength and direction)
The range of the correlation coefficient is from -1 to 1.
here result is 0.5298 or 52.98%, which means the variables have a moderate positive correlation( some what more ). We
cant infer with 52.98% correlation that age has impact on rise in glucose levels. We need more data to analyze 127
128
Redundancy and Correlation Analysis:
Chi-square Test
•A correlation relationship between two categorical (discrete) attributes, A and B, can
be discovered by a X2 (chi-square) test.

130
III- Data Reduction
Data Reduction
strategies:
• Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies

1.Data cube aggregation

3. Dimensionality reduction-remove unimportant attributes/variables


Eliminate the redundant attributes: which are weekly important across the
data.
• Wavelet transforms/ Data compression
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
4. Numerosity reduction- replace original data volume by smaller forms of data
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation

132
1. Data cube aggregation
For example, the data consists of AllElectronics sales per quarter for the years 2014 to
2017.You are, however, interested in the annual sales, rather than the total per quarter. Thus,
the data can be aggregatedso that the resulting data summarize the total sales per year instead
of per quarter

Year/Quarter 2014 2015 2016 2017 Year Sales


Quarter 1 200 210 320 230 2014 1640
Quarter 2 400 440 480 420 2015 1710
Quarter 3 480 480 540 460 2016 2020
Quarter 4 560 580 680 640 2017 1750

133
2. Dimensionality Reduction
• Know about Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• What is Dimensionality reduction
• Method of eliminating irrelevant features so as to reduce noise
• Is proposed to avoid the curse of dimensionality
• Reduce time and space required in data mining
• Allow easier visualization of data (quite messy to visualize huge data)
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

134
2. Normalization
• An attribute is normalized by scaling its values so that they fall within a small specified
range, such as 0.0 to 1.0.
• Normalization is particularly useful for classification algorithms involving
– neural networks
– distance measurements such as nearest-neighbor classification and clustering.
• If using the neural network backpropagation algorithm for classification mining,
normalizing the input values for each attribute measured in the training instances will
help speed up the learning phase.
• For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from out-weighing attributes with initially smaller ranges (e.g.,
binary attributes).
• Normalization methods
I. Min-max normalization
II. z-score normalization
135
III. Normalization by decimal scaling
Min-max Normalization
• Min-max normalization
– performs a linear transformation on the original data.
• Suppose that:
– minA and maxA are the minimum and maximum values of
an attribute, A.
• Min-max normalization maps a value, v, of A to v′ in
the range [new_minA, new_maxA] by computing:
v'  v  minA (new _ maxA  new _ minA)  new _
minAmaxA  minA

136
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600  12,000
(1.0  0)  0 0.716
$73,000 is mapped to 98,000  12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
1.225
• Ex. Let μ = 54,000, σ = 16,000. Then 16,000

• Normalization
v by decimal scaling
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
137
3. Data Aggregation
• On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the
annual sales
• Sales data for a given branch of AllElectronics for the years 2002 to 2004.

• Data cubes store multidimensional aggregated information.


• Data cubes provide fast access to precomputed, summarized data, thereby benefiting on-line analytical
processing as well as data mining.
• A data cube for sales at AllElectronics.
138
4. Attribute Construction
• Attribute construction (feature construction)
– new attributes are constructed from the given attributes and added in order to help improve
the accuracy and understanding of structure in high-dimensional data.
• Example
– we may wish to add the attribute area based on the attributes height and width.
• By attribute construction can discover missing information.
• Why attribute subset selection
– Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to
the mining task or redundant.
• For example,
– if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music_taste.
139
Attribute Subset Selection
• Using domain expert to pick out some of the useful attributes
– Sometimes this can be a difficult and time-consuming task, especially when the behavior of the
data is not well known.
Leaving out relevant attributes or keeping irrelevant attributes result in discovered

patterns of poor quality.
• In addition, the added volume of irrelevant or redundant attributes can slow down the
mining process.
• Attribute subset selection (feature selection):
– Reduce the data set size by removing irrelevant or redundant attributes
• Goal:
– select a minimum set of features (attributes) such that the probability distribution of different
classes given the values for those features is as close as possible to the original distribution
given the values of all features
– It reduces the number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand. 140
Attribute Subset Selection
• How can we find a ‘good’ subset of the original attributes?
– For n attributes, there are 2n possible subsets.
– An exhaustive search for the optimal subset of attributes can be prohibitively expensive, especially as n
increase.
– Heuristic methods that explore a reduced search space are commonly used for attribute subset selection.
– These methods are typically greedy in that, while searching through attribute space, they always make
what looks to be the best choice at the time.
– Such greedy methods are effective in practice and may come close to estimating an optimal solution.
• Heuristic methods:
1. Step-wise forward selection
2. Step-wise backward elimination
3. Combining forward selection and backward elimination
4. Decision-tree induction
• The “best” (and “worst”) attributes are typically determined using:
– the tests of statistical significance, which assume that the attributes are independent of one another.
141
– the information gain measure used in building decision trees for classification.
Attribute Subset Selection
• Stepwise forward selection: • Stepwise backward elimination:
– The procedure starts with an empty set of attributes as – The procedure starts with the full set of
the reduced set.
attributes.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of
– At each step, it removes the worst attribute
the remaining original attributes is added to the set. remaining in the set.

• Combining forward selection and backward elimination:


– The stepwise forward selection and backward elimination methods can be combined
– At each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes. 142
Attribute Subset Selection
• Decision tree induction:
– Decision tree algorithms, such as ID3, C4.5, and CART,
were originally intended for classification.
– Decision tree induction constructs a flowchart-like
structure where each internal (nonleaf) node denotes a test
on an attribute, each branch corresponds to an outcome of
the test, and each external (leaf) node denotes a class
prediction.
– At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
– When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
– All attributes that do not appear in the tree are assumed to
be irrelevant.

143
5. Generalization
• Generalization is the generation of concept hierarchies for categorical data
• Categorical attributes have a finite (but possibly large) number of distinct
values, with no ordering among the values.
• Examples include
– geographic location,
– job category, and
– itemtype.
• A relational database or a dimension location of a data warehouse may contain the
following group of attributes: street, city, province or state, and country.
• A user or expert can easily define a concept hierarchy by specifying ordering of the
attributes at the schema level.
• A hierarchy can be defined by specifying the total ordering among these attributes at the
schema level, such as:
◆ street < city < province or state < country

144
6. Discretization
• Three types of attributes:
• Nominal — values from an unordered set, e.g., color, profession
• Ordinal — values from an ordered set, e.g., military or academic rank
• Continuous — real numbers, e.g., integer or real numbers

• Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical attributes.
• Reduce data size by discretization

145
UNIT II – Exploratory Data
Analysis:
Hypothesis testing: t-Test, z-Test, Chi-Square-Test. Analysis of Variance
(ANOVA): One-way, Two-way. Multivariate Analysis: Mean Vector,
Covariance, Correlation and Precision Matrices, Multivariate Data,
Parameter Estimation, Estimation of Missing Values, Multivariate
Normal Distribution. Dimensionality Reduction: Principal Component
Analysis and Multi-Dimensional Scaling.

146
Data Science
UNIT-I
Hypothesis Testing
(Introduction, Z-Test, P-Test, T-Test)

147
Hypothesis Testing?
Hypothesis testing is a statistical method used to make decisions based
on data.
We always start with two competing statements:
•Null Hypothesis (H₀): There is no effect or no difference.
•Alternative Hypothesis (H₁ or Hₐ): There is an effect or a difference.

148
149
150
A vaccine company claims that their vaccine is at least 80% effective in
preventing COVID-19 infections. We want to test if this claim is statistically
valid using data from a sample of 1000 people who got the vaccine.

151
152
153
154
from statsmodels.stats.proportion import proportions_ztest

# Data
success = 770 # people not infected
n = 1000 # total vaccinated

# Test
stat, p_value = proportions_ztest(count=success, nobs=n, value=0.80, alternative='smaller')

print(f"Z-statistic: {stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Decision
if p_value < 0.05:
print("Reject H₀: Evidence that vaccine effectiveness is less than 80%")
else:
print("Fail to reject H₀: No evidence that effectiveness is below 80%")

155
t-Test
A t-Test is a hypothesis test used to compare means when the
population standard deviation is unknown and the sample size is small (n
< 30 usually)

156
157
158
z-Test
A z-Test is a statistical test used to determine whether there is a
significant difference between sample and population means, or
between two sample means, when:
•The population standard deviation is known.
•The sample size is large (n ≥ 30) (Central Limit Theorem applies).
When to Use z-Test?
The data is normally distributed (or large enough sample size).
•The population standard deviation (σ) is known.
•You're testing:
•Sample mean vs Population mean
•Two sample means

159
Types of z-Tests
•One-sample z-Test: Compare sample mean with population mean.
•Two-sample z-Test: Compare means from two independent samples.
•z-Test for proportions (e.g., testing success rates).

160
The average batting score of a professional cricket player is believed to
be 40 runs per innings. A coach thinks that a particular player, Player
A, has improved and now scores more on average. To test this, he
collects data from the last 36 innings of Player A.

161
162
163
We want to compare Player A and Player B to check if Player A scores
more on average than Player B.

164
Player B (n₂ = 36 innings):

165
166
167
Chi-Square Test
The Chi-Square (χ²) Test is a non-parametric statistical test used to
compare observed data with expected data to determine if there’s a
significant association or difference.

168
• A sports analyst wants to know if there’s a relationship between a
cricket player's batting position and their performance (above
average vs below average).

169
170
171
172
173
174
175
DATA SCIENCE
UNIT II

ANOVA
From: Web resources
Contents

• Introduction
• Types of ANOVA
• ANOVA Procedure
• One way ANOVA
• Two way ANOVA
• Application
Introduction
• ANOVA is a statistical test for estimating how a quantitative dependent variable
changes according to the levels of one or more categorical independent variables
.
• ANOVA tests whether there is a difference in means of the groups at each level of
the independent variable.
• Analysis of variance (abbreviated as ANOVA)
• An extremely useful technique concerning researches in the many fields of economics,
biology, education, psychology, sociology, business/industry and in researches of several
other disciplines.
• This technique is used when multiple sample cases are involved.
• The ANOVA technique enables us to perform to examine the significance of the difference
amongst more than two sample means at the same time.

• Using this technique, one can draw inferences about whether the samples have been
drawn from populations having the same mean.
What is ANOVA ?
• Variance is an important statistical measure and is described as the mean
of the squares of deviations taken from the mean of the given series of
data. It is a frequently used measure of variation. square of standard
deviation is called variance.
• i.e., Variance = (standard deviation)2 .
• There may be variation between samples and also within sample items.
• In other words, ANOVA help us to figure out if there is need to
reject the null hypothesis or accept the alternate hypothesis
ANOVA
Examples:
A group of psychiatric patients are trying three different therapies: counseling,
medication and biofeedback. We want to see if one therapy is better than the
others.
A manufacturer has two different processes to make light bulbs. They want to know
if one process is better than the other.
Students from different colleges take the same exam. You want to see if one college
outperforms the other.
Types of
ANOVA
 ANOVA is two types
One way ANOVA :
only one factor is investigate
 one independent variable (with 2 levels)
 Analysis of Variance could have one independent variable

Two Way ANOVA :


investigate two factors at the same time.
 two independent variables (can have multiple levels).
 Analysis of Variance has two independent variables.
I. Two way ANOVA without replication
II. Two way ANOVA with replication
Analysis of Variance (ANOVA)
•ANOVA is a statistical method used to compare means among different
groups to see if there is a significant difference between them. It helps
determine if the observed differences in sample means are likely to occur
by chance or if they reflect true differences in the population means.
•There are two main types of ANOVA:
One-Way ANOVA: Compares means among three or more independent
groups based on one factor.
Two-Way ANOVA: Compares means among groups that differ on two
factors, which can help understand the interaction between these factors.
• One-Way ANOVA Example
• Scenario: We want to determine if there is a significant difference in
the average number of COVID-19 cases across three regions.
• Hypothetical Data:
Region A: [120, 135, 150, 160, 145]
Region B: [200, 210, 205, 215, 190]
Region C: [300, 320, 310, 330, 290]
Step-by-Step Calculation:
•State the Hypotheses:
• Null Hypothesis (H0): The means of COVID-19 cases across the regions are equal.
• Alternative Hypothesis (H1): At least one region's mean is different.
•Calculate the Group Means:
• Mean of Region A = (120 + 135 + 150 + 160 + 145) / 5 = 142
• Mean of Region B = (200 + 210 + 205 + 215 + 190) / 5 = 204
• Mean of Region C = (300 + 320 + 310 + 330 + 290) / 5 = 310
•Calculate the Overall Mean:
• Overall Mean = (Sum of all data points) / Total number of data points
• Overall Mean = (120+135+150+160+145+200+210+205+215+190+300+320+310+330+290) /
15 = 218.67
•Calculate the Between-Group Variability (SSB):
•SSB = n * Σ(Mean of each group - Overall Mean)²
•SSB = 5 * ((142 - 218.67)² + (204 - 218.67)² + (310 - 218.67)²) = 8897.78
Calculate the Within-Group Variability (SSW):
•SSW = Σ(Each value - Mean of its group)²
•For Region A: (120-142)² + (135-142)² + ... + (145-142)² = 905
•For Region B: (200-204)² + (210-204)² + ... + (190-204)² = 750
•For Region C: (300-310)² + (320-310)² + ... + (290-310)² = 1200
•Total SSW = 905 + 750 + 1200 = 2855

•Calculate the F-Ratio:


•F = (SSB/df between groups) / (SSW/df within groups)
•F = (8897.78/2) / (2855/12) = 18.68
•Interpret the Result: Compare the F-ratio to the critical F-value from the
•F-distribution table at a significance level of 0.05. If the
•calculated F-value is greater, reject the null hypothesis.
Two-Way ANOVA
Scenario
•We have data from different regions and different lockdown levels.
Our goal is to see:
•If the region affects the number of COVID-19 cases.
•If the lockdown level affects the number of COVID-19 cases.
•If there's an interaction effect between the region and lockdown level.
In this example, the F-ratios tell us how much the variability in the
number of COVID-19 cases can be explained by each factor and their
interaction. You would then compare these F-ratios to critical values
from the F-distribution table to determine statistical significance.
Two-Way ANOVA Example
•Scenario: Now, we want to see the effect of both region and lockdown
level on the number of COVID-19 cases. We'll consider two regions and
two lockdown levels.
EXTRA EXAMPLES

196
Two way anova with
replication
Example 2 :set up anova table for the following information relating to the three
drugs testing to judge the effectiveness in reducing blood pressure for three different
groups of people
Drugs
X Y Z
Group of people 14 10 11
A 15 9 11

B 12 7 10
11 8 11
C 10 11 8
11 11 7
Computation for two way anova with repeated
values
Step (i) T = 187 , n = 18

Step (ii) correction factor = 187*187 /18


= 1942.72
Step(iii) SS between column (i.e between drugs)
= (73*73/6 + 56*56/6
+ 58*58/6) - (187*187/18)
= 888.16 + 522.66 + 560.67 -1942.72
= 28.77
Step (iii) SS between rows (i.e between people)
= (70*70/6 + 59*59/6 + 58*58/6) – (187*187/18)

= 816.67 + 580.16 + 560.67 – 1942.72


= 14.78
Step(iv) total SS ={(14)2 + (15)2 + (12)2 + (11)2 +(10)2
+ (11)2 +
(10)2 + (9)2 + (7)2 + (8)2 + (11)2 + (11)2 +
(11)2 + (11)2 + (10)2 + (11)2 + (8)2 + (7)2 –
(187*187/18)
= 2019 -1942.72
= 76.28
Step (v) SS within
samples

= (14-14.5)2 + (15-14.5)2 + (10-9.5)2 + (9-9.5)2 +


(11-11)2 + (11-11)2 + (12-11.5)2 +(11-11.5)2 +
(7-7.5)2 + (8-7.5)2 + (10 -10.5 )2 + (11 – 10.5 )2 +
( 10- 10.5)2 + (11 -10.5 )2 + (11 -11 )2 + (11- 11 )2
+ (8 – 7.5 )2 + (7 – 7.5)2 = 3.50

Step (vi) SS for interaction


= 76.28 - [28.77 + 14.78 + 3.50] = 29.23
Table : The Anova
Source of Table F-ratio 5% F –
SS variation limit
Between 28.77
d.f.
(3-1) = 2
MS 28.77/2 14.385/ 0.389 F (2,9) = 4.26
column (i.e = 14.385 = 36.9
between
drugs)
Between 14.78 (3-1) = 2 14.78 / 2 7.390/0.389 F (2,9) = 4.26
rows = 7.390 = 19.0
(i.e
between
people)
Interaction 29.23 4 29.23 / 4 7.308/0.389 F(4,9) = 3.63
= 7.308 = 18.786

Within 3.50 (18- 9) = 9


sample
errors.

Tota 76.28 (18- 1)


l = 17
Conclusion

The above table show that all the


three f - ratio are significant of
5% level which means that

-the drugs act differently ,

-different groups of people are affected differently

-the interaction terms is significant


Steps in Hypothesis Testing

 1. Specify the population value of interest


 2. Formulate the appropriate null and
alternative hypotheses
 3. Specify the desired level of significance
 4. Determine the rejection region
 5. Obtain sample evidence and compute the
test statistic
 6. Re a c h a decisi o n a n d inter pr et the 24
203
Steps in Hypothesis Testing
A Statistical hypothesis is a conjecture about a population
parameter.This conjecture may or may not be true.

The null
● hypothesis, symbolized by H0, is a statistical
hypothesis that states that there is no difference between a
parameter and a specific value or that there is no difference
between two parameters.

The alternative hypothesis, symbolized by H1, is a statistical


hypothesis that states a specific difference between a
parameter and a specific value or states that there is a
difference between two parameters.

204
Steps in Hypothesis
Testing
A statistical test uses the data

obtained from a
sample to make a decision about whether or not the
null hypothesis should be rejected.


The numerical value obtained from a statistical test
is called the test value.

In the hypothesis-testing situation, there are four
possible outcomes.

In reality, the null hypothesis may or may not be
true, and a decision is made to reject or not to
reject it on the basis of the data obtained from a
sample.
205
Steps in Hypothesis Testing

H0 True H0 False

Error Correct
Reject Type I
H0 decisio
n
Correct Error
Do not
Type II
reject decisio
H0 n
206
Steps in Hypothesis
Testing
A type I error occurs if one

rejects the null hypothesis when
it is true.

A type II error occurs if one does not reject the null
hypothesis when it is false.
The level of significance is the maximum probability of
committing a type I error.
This probability is symbolized by  (Greek letter alpha).
That is, P(type I error)=.
P(type II error) =  (Greek letter beta).

207
Steps in Hypothesis
Testing

Typical significance levels are: 0.10, 0.05, and 0.01.

For example, when  = 0.10, there is a 10% chance of rejecting a true
null hypothesis.

The critical value(s) separates the critical region from the noncritical
region.

The symbol for critical value is C.V.


The critical or rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis
should be rejected.

The noncritical or nonrejection region is the range of values of the test
value that indicates that the difference was probably due to chance and
that the null hypothesis should not be rejected.
208
Critical Value Approach
to Testing
 Convert sample statistic (e.g.: ) to test
x
statistic ( Z or t statistic )
 Determine the critical value(s) for a specified
level of significance  from a table or
computer

 If the test statistic falls in the rejection region,


reject H0 ; otherwise do not reject H0
209
Lower Tail
Tests
H0: μ ≥ 3
The cutoff value,

-zα or xα, is called a HA: μ < 3

critical value 

Reject H0 Do not reject H0


-zα 0
xα μ
σ
x  μ  
z n
Hypothesis Testing - Example

A contractor wishes to lower heating


bills by using a special type of
insulation in houses. If the average of
the monthly heating bills is $78, her
hypotheses about heating costs will
be
● H0:   $78 H1:  
$78

This is a left-tailed test.
211
Upper Tail
Tests
H0: μ ≤ 3
The cutoff value,
HA: μ > 3

or xα, is called a
critical value

Do not reject H0 Reject H0


0 zα
μ xα

σ
x  μ  
z n
Hypothesis Testing -
Example
A chemist invents an additive to
increase the life of an automobile
battery. If the mean lifetime of the
battery is 36 months, then his
hypotheses are

H0:   36 H1:   36
This is a right-tailed test.

213
Two Tailed
Tests
There are two cutoff values H0: μ = 3
(critical values): HA: μ 

± zα/2
or /2 /2
xα/2
Lower
Reject H0 Do not reject H0 Reject H0
0 zα/2
xα/2 -
μ0
Upper
xα/2 xα/2
zα/2 Lower Upper

σ
x/2  μ  /2
z n 19
214
Hypothesis Testing -
Example
A medical researcher is interested in finding out
whether a new medication will have any
undesirable side effects. The researcher is
particularly concerned with the pulse rate of the
patients who take the medication.
What are the hypotheses to test whether the
pulse rate will be different from the mean pulse
rate of 82 beats per minute?
●H0:  = 82 H1:   82

This is a two-tailed test.

215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
Data Science
Unit-II
Multivariate Analysis

232
Multivariant Data
Multivariate data refers to data that involves multiple variables or
measurements. Each observation in the dataset has values for multiple
attributes. Analyzing multivariate data is essential in many fields, including
statistics, data science, machine learning, and various scientific disciplines.
Characteristics of Multivariate Data
1.Multiple Variables: Multivariate data contains more than one variable or
attribute per observation.
2.Complexity: The relationships between variables can be complex,
requiring sophisticated statistical and computational methods for analysis.
3.Higher Dimensionality: With more variables, the data becomes high-
dimensional, which can lead to challenges in visualization and
interpretation.
Example: Healthcare, Financial, Environment
233
Multivariant Analysis
• Multivariate analysis encompasses all statistical techniques
that are used to analyze more than two variables at
once.
• The aim is to find patterns and correlations between several
variables simultaneously—allowing for a much deeper, more
complex understanding of a given scenario than you’ll get
with bivariate analysis.
There are three categories of analysis to be aware of:
• Univariate analysis, which looks at just one variable: Bar
charts, Pie charts, Histograms
• Bivariate analysis, which analyzes two variables, Ex:
Scatter plot, Regression and Correlation anlysis
• Multivariate analysis, which looks at more than two
variables Ex: Clustering etc

234
235
import numpy as np

# Dataset
data = np.array([[150, 50],
[160, 60],
[170, 65]])

# Mean Vector
mean_vector = np.mean(data, axis=0)
print("Mean Vector:\n", mean_vector)

236
237
238
239
240
corr_matrix = np.corrcoef(data.T)
print("Correlation Matrix:\n", corr_matrix)

241
242
243
Height and Weight have a very strong positive correlation (~0.98), meaning as height increases, weight also
tends to increase significantly.

244
precision_matrix = np.linalg.inv(cov_matrix)
print("Precision Matrix:\n", precision_matrix)

245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Analysis of Variance (ANOVA)

261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
Data Science
Unit-II
Principal Component Analysis (PCA)
Dimensionality Reduction
From : Web resources

277
Dimensionality Reduction: Principal
Component Analysis (PCA) and
Multi-Dimensional Scaling (MDS)
Dimensionality Reduction?
Dimensionality reduction is the process of reducing the number of
random variables under consideration.
It can be divided into:
•Feature selection: Selecting a subset of the original variables.
•Feature extraction: Transforming data into a lower-dimensional
space.

278
Why do we need Dimensionality Reduction?
•Visualization of high-dimensional data (e.g., 2D or 3D).
•Noise reduction: Eliminate less informative features.
•Computation efficiency: Faster training and prediction.
•Avoiding the curse of dimensionality: In high dimensions, data
becomes sparse.
Principal Component Analysis (PCA)
•PCA is a linear transformation technique.
•It projects the data into a lower-dimensional space while preserving
maximum variance.

279
Mathematical Steps:

280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
Dimensionality Reduction:
In machine learning classification problems, there are often too many factors on the basis of which the final
classification is done. These factors are basically variables called features. The higher the number of features,
the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are
correlated, and hence redundant. This is where dimensionality reduction algorithms come into play.

Dimensionality reduction is the process of reducing the number of random variables under consideration, by
obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

•Problem with high dimensional data?


•It can mean high computational cost to perform learning.
•It often leads to over-fitting when learning a model, which means that the model will perform well on
the training data but poorly on test data.
•Data are rarely randomly distributed in high-dimensions and are highly correlated, often with
spurious correlations.
•The distances between a nearest and farthest data point can become equidistant in high dimensions,
that can hamper the accuracy of some distance-based analysis tools.

302
Importance of Dimensionality reduction?
•It reduces the time and storage space required.
•It helps Remove multi-collinearity which improves the interpretation of the parameters of the
machine learning model.
•It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
•It avoids the curse of dimensionality.
•It removes irrelevant features from the data, Because having irrelevant features in the data can
decrease the accuracy of the models and make your model learn based on irrelevant features.

Methods for Dimensionality Reduction:


1.Principal Component Analysis (PCA)
2.Multi Dimensionality Scaling (MDS)

303
UNIT III – Predictive Analysis-I
Simple linear Regression: Model Building, Estimation of Parameters,
Interpret coefficients, Validation of model, Outlier analysis. Bias,
variance, and trade-off, Gradient descent, over and under fitting
models.

304
Validation of Linear Regression
Model
Bias, Variance, and the Trade-
off
Overfitting and Underfitting?

You might also like