Ds Notes-Unit 1, II and III Upto Part1
Ds Notes-Unit 1, II and III Upto Part1
• A multi-disciplinary
field that uses
scientific methods,
processes,
algorithms and
systems to extract
knowledge and
insights from
structured and
unstructured data.
Advantages and Disadvantages of Data
Science
Advantages
•It helps us to get insights from the historical data with its powerful tools.
•It helps to optimize the business, hire the right persons and generate more revenue as using
data science helps you to make better future decisions for the business.
•Companies can develop and market their products better as they can better select their
target customers.
•Introduction to Data Science also helps consumers search for better goods, especially in e-
commerce sites based on the data-driven recommendation system.
Disadvantages
•The disadvantages are generally when data science is used for customer profiling and
infringement of customer privacy, as their information, such as transactions, purchases, and
subscriptions, is visible their parent companies. The information obtained using data science
can be used against a certain group, individual, country or community.
2.What is BIG DATA?
1. It aims to predict the probability of occurrence of a future event such as forecasting demand for
products/services, customer churn, employee attrition, loan defaults, fraudulent transactions,
insurance claim, and stock market fluctuations.
2. predictive analytics is used for predicting what is likely to happen in the future.
Types of data Analytics- PRESCRIPTIVE
ANALYTICS
Structured data means that the data is described in a matrix form with
labelled rows and columns.
Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
For example, e-mails, click streams, textual data, images (photos and
images generated by medical devices), log data, and videos. Machine
generated data such as images generated by satellite, magnetic
resonance imaging (MRI), electrocardiogram (ECG) and thermography
are few examples of unstructured data.
Data types and scales
Cross-sectional, Time Series, and Panel Data
34
Types of attributes
35
Nominal Attributes-related to names
36
37
38
39
POPULATION AND SAMPLE
Population is the set of all possible observations (often called cases, records, subjects or data points)
for a given context of the problem. The size of the population can be very large in many cases.
The sample is a logical subset of the population, which mimics the population.
Selecting a relevant sample out of the population is challenging, but it makes
analysis faster, precise, and economical.
There are standard guidelines from statisticians to calculate the relevant sample
size, appropriate sampling methodology, and tool to analyze sampled data.
MEASURES OF CENTRAL TENDENCY
Measures of central tendency are the measures that are used for describing the data using a single value. Mean,
median and mode are the three measures of central tendency and are frequently used to compare different data
sets.
Measures of central tendency help users to summarize and comprehend the data.
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF VARIATION
MEASURES OF VARIATION
MEASURES OF VARIATION
MEASURES OF VARIATION
Data science
UNIT-I
Similarity and Dissimilarity
measures
51
Similarity and Dissimilarity
Similarity
52
Similarity and Dissimilarity
Applications:
• Web search
• Computer vision:
• Image Processing
• Natural language processing
• Clustering outliers
53
Similarity / Proximity Measures
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numeric attributes 1. Scaled 2. Ratio
• Mixed attributes
54
• Similarity quantifies how alike two objects (data points) are. The
higher the similarity, the more alike the objects.
Data Science
UNIT-1
Data Preprocessing
I-Data Cleaning,
II-Data Integration,
III-Data Reduction,
III-Data Transformation and Data
Discretization,
59
60
Data Quality: Why Preprocess
the Data?
• There are many factors/measures comprising data quality
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not, errors from instruments , data transmission
errors
• Completeness: not recorded, unavailable, user interested attributes may not be
available causing unfilled data
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update? Several sales representatives, however, fail to submit their
sales records on time at the end of the month.
• For a period of time following each month, the data stored in the database are
incomplete.
• However, once all of the data are received, it is correct.
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
• Suppose that a database, at one point, had several errors, all of which have since
been corrected.
• The data also use many accounting codes, which the sales department does not
know how to interpret. Even though the database is 61
Data Mining as Knowledge
Discovery
•Data cleaning - to remove noise or
irrelevant data
•Data integration - where multiple data
sources may be combined
•Data selection- where data relevant to the
analysis task are retrieved from the database
•Data transformation -where data are
transformed or consolidated into forms
appropriate for mining by
•performing summary or aggregation
operations
•Data mining - an essential process where
intelligent methods are applied in order to
extract data patterns
•Pattern evaluation to identify the truly
interesting patterns representing knowledge
based
•Knowledge presentation -where
visualization and knowledge representation
techniques are used to present
62
Major Tasks/Steps in Data
Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data,
• identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
63
Data Science
UNIT-1
Data Preprocessing
I-Data Cleaning,
II-Data Integration,
III-Data Reduction,
III-Data Transformation and Data
Discretization,
64
65
Data Quality: Why Preprocess
the Data?
• There are many factors/measures comprising data quality
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not, errors from instruments , data transmission
errors
• Completeness: not recorded, unavailable, user interested attributes may not be
available causing unfilled data
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update? Several sales representatives, however, fail to submit their
sales records on time at the end of the month.
• For a period of time following each month, the data stored in the database are
incomplete.
• However, once all of the data are received, it is correct.
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
• Suppose that a database, at one point, had several errors, all of which have since
been corrected.
• The data also use many accounting codes, which the sales department does not
know how to interpret. Even though the database is 66
Data Mining as Knowledge
Discovery
•Data cleaning - to remove noise or
irrelevant data
•Data integration - where multiple data
sources may be combined
•Data selection- where data relevant to the
analysis task are retrieved from the database
•Data transformation -where data are
transformed or consolidated into forms
appropriate for mining by
•performing summary or aggregation
operations
•Data mining - an essential process where
intelligent methods are applied in order to
extract data patterns
•Pattern evaluation to identify the truly
interesting patterns representing knowledge
based
•Knowledge presentation -where
visualization and knowledge representation
techniques are used to present
67
Major Tasks/Steps in Data
Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data,
• identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
68
I-Data Cleaning
69
I- Data Cleaning: Incomplete
(Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time
of entry
• not register history or changes of the data
• Missing data may need to be inferred
70
import pandas as pd equipment malfunction
import numpy as np
71
2. Inconsistent Data Deletion Example
Scenario: Negative age values removed
df = pd.DataFrame(data)
clean_df = df[df['Age'] > 0] # Remove inconsistent record
print(clean_df)
72
3. Data Entry Misunderstanding Example
Scenario: Mixed units (kg/lbs) causing missing values
df = pd.DataFrame(data)
# Convert all to kg and fill missing with average
df['Weight_kg'] = df['Weight'].replace({'150lbs': 68.04}) # 150lbs = 68.04kg
df['Weight_kg'] = pd.to_numeric(df['Weight_kg'], errors='coerce').fillna(68)
print(df)
73
4. Unimportant Data Example
Scenario: Optional salary field left empty
df = pd.DataFrame(data)
# Fill missing with median salary
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
print(df)
74
5. Unregistered Changes Example
Scenario: Missed price updates
df = pd.DataFrame(data)
# Linear interpolation for missing price
df['Price'] = df['Price'].interpolate()
print(df)
75
# Customer data with missing income
data = {'Name': ['Mike', 'Jenny'],
'Age': [40, 20],
'Sex': ['Male', 'Female'],
'Income': [150000, np.nan], # Jenny's income missing
'Class': ['Big spender', 'Regular']}
df = pd.DataFrame(data)
print(df)
76
Data Cleaning: How to Handle
Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)
—not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision
tree, or use regression to fill
77
1. Ignoring the Tuple (Complete Case Analysis)
When: Missing class label in classification
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
clean_df = df.dropna(subset=['Class']) # Remove row with missing class
print(clean_df)
78
2. Manual Filling
When: Small dataset with known values
df = pd.DataFrame(data)
# Manually fill based on knowledge:
df.loc[1, 'Price'] = 15.99 # We know Product B should be $15.99
print(df)
79
3. Global Constant
When: Categorical data or placeholder needed
df = pd.DataFrame(data)
df['City'] = df['City'].fillna('Unknown')
print(df)
80
4. Attribute Mean
When: Numerical data with random missingness
df = pd.DataFrame(data)
df['Test1'] = df['Test1'].fillna(df['Test1'].mean())
df['Test2'] = df['Test2'].fillna(df['Test2'].median()) # Using median if outliers exist
print(df)
81
Group by
import pandas as pd
# Sample data
data = {
'Student': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Science', 'Science', 'Math'],
'Score': [85, 78, 90, 88, 82]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
82
Use groupby() to find average
score per subject:
# Group by Subject and calculate average score
avg_score = df.groupby('Subject')['Score'].mean()
print("\nAverage Score by Subject:")
print(avg_score)
df.groupby(['Student', 'Subject'])['Score'].mean()
83
5. Class-Specific Mean (Smarter)
When: Data grouped by categories
df = pd.DataFrame(data)
df['Score'] = df.groupby('Grade')['Score'].transform(
lambda x: x.fillna(x.mean()))
print(df)
# A's NaN → 95 (mean of 95), B's NaN → 77.5 (mean of 80,75)
84
6. Most Probable Value (Regression)
When: Strong correlations exist
df = pd.DataFrame(data)
# Train on complete cases
model = LinearRegression()
model.fit(df[['X']][df['Y'].notna()], df['Y'][df['Y'].notna()])
# Predict missing
df.loc[df['Y'].isna(), 'Y'] = model.predict([[3]])
print(df) # Missing Y at X=3 becomes 6
85
7. Decision Tree Imputation
When: Complex relationships
df = pd.DataFrame(data)
# Train model
model = DecisionTreeRegressor()
train = df[df['Experience'].notna()]
model.fit(train[['Age', 'Salary']], train['Experience'])
# Predict missing
df.loc[df['Experience'].isna(), 'Experience'] = model.predict(
df[df['Experience'].isna()][['Age', 'Salary']])
print(df) # Experience for Age=35/Salary=75k predicted
86
Data Cleaning:
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
87
87
1. Faulty Instrument Example (Sensor Noise)
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
df = pd.DataFrame(data)
89
3. Transmission Error Example (Corrupted Data)
df = pd.DataFrame(data)
90
4. Technology Limitation Example (Precision Errors)
df = pd.DataFrame(data)
91
5. Naming Inconsistency Example
df = pd.DataFrame(data)
# Standardize categories
df['Clean_Category'] = df['Product'].str.lower().str.strip()
sales_summary = df.groupby('Clean_Category')['Sales'].sum()
print(sales_summary)
92
6. Duplicate Records Example
df = pd.DataFrame(data)
93
7. Incomplete Data Example
df = pd.DataFrame(data)
# Multiple imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df[['Age', 'Satisfaction']] = imputer.fit_transform(df[['Age',
'Satisfaction']])
print(df)
94
8. Inconsistent Data Example
df = pd.DataFrame(data)
# Standardize dates
df['Clean_Date'] = pd.to_datetime(df['Date'], errors='coerce')
print(df)
95
Noise Cleaning Pipeline
def clean_dataset(df):
# Handle numeric noise
for col in df.select_dtypes(include=np.number):
df[col] = np.clip(df[col],
df[col].quantile(0.01),
df[col].quantile(0.99))
# Standardize text
for col in df.select_dtypes(include='object'):
df[col] = df[col].str.lower().str.strip()
# Remove duplicates
df = df.drop_duplicates()
return df
96
Data Cleaning: How to Handle
Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering/Outlier
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
97
1. Binning Method (Equal-Frequency)
import pandas as pd
import numpy as np
print(df.sort_values('Values'))
98
2. Regression Smoothing
# Prepare data
X = df.index.values.reshape(-1, 1)
y = df['Values'].values
# Visual comparison
import matplotlib.pyplot as plt
plt.scatter(df.index, df['Values'], label='Original')
plt.plot(df.index, df['Regression_Smoothed'], color='red',
label='Smoothed')
plt.legend()
plt.show()
99
3. Clustering/Outlier Detection
# Scale data
scaler = StandardScaler()
scaled_values = scaler.fit_transform(df[['Values']])
# Remove outliers
clean_df = df[df['Cluster'] != -1]
100
4. Human-in-the-Loop Inspection
df = flag_suspicious(df, 'Values')
print("Values needing manual review:")
print(df[df['Suspicious']][['Values']])
101
Data Cleaning: Handling Noisy Data:
Binning
• “Data smoothing is the technique to handle noisy data
• “smooth” out the data to remove the noise
• data smoothing techniques.
• Binning
• Regression
• Outlier analysis
• Binning:
• This method works on sorted data
• The sorted data is divided into equal frequency buckets/bins
• Binning is of three types:
1. smoothing by bin means
• each value in a bin is replaced by the mean value of the bin.
2. smoothing by bin medians
• each bin value is replaced by the bin median.
3. smoothing by bin boundaries
• the minimum and maximum values in a given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value. 102
Binning Methods for Noisy Data Smoothing
1. Preparing Noisy Data
import pandas as pd
import numpy as np
103
2. Equal-Frequency Binning
104
3. Smoothing by Bin Means
df['Bin_Mean'] = df.groupby('Bin')['Values'].transform('mean')
print("\nSmoothing by Bin Means:")
print(df[['Values', 'Bin', 'Bin_Mean']].head(10))
df['Bin_Median'] = df.groupby('Bin')
['Values'].transform('median')
print("\nSmoothing by Bin Medians:")
print(df[['Values', 'Bin', 'Bin_Median']].sample(10))
105
5. Smoothing by Bin Boundaries
def replace_with_boundary(row):
bin_id = row['Bin']
val = row['Values']
boundaries = bin_boundaries.loc[bin_id]
return boundaries['min'] if abs(val - boundaries['min']) <
abs(val - boundaries['max']) else boundaries['max']
106
6. Visual Comparison
plt.figure(figsize=(12, 6))
plt.scatter(df.index, df['Values'], label='Original', alpha=0.5)
plt.scatter(df.index, df['Bin_Mean'], label='Bin Means',
alpha=0.7)
plt.scatter(df.index, df['Bin_Median'], label='Bin Medians',
alpha=0.7)
plt.scatter(df.index, df['Bin_Boundary'], label='Bin Boundaries',
alpha=0.7)
plt.legend()
plt.title("Binning Smoothing Techniques Comparison")
plt.show()
107
Data Cleaning: Binning methods
108
Binning Methods for Data Smoothing
1. Equal-Width (Distance) Binning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Visualize
plt.figure(figsize=(10,5))
plt.hist(df['Values'], bins=50, alpha=0.5, label='Original Data')
for boundary in df['Equal_Freq_Bin'].cat.categories.left:
plt.axvline(x=boundary, color='green', linestyle='--',
alpha=0.5)
plt.title('Equal-Frequency Binning (Balanced Points per Bin)')
plt.legend()
plt.show()
110
3. Comparison of Methods
111
4. Handling Skewed Data
# Create right-skewed data
skewed_data = np.concatenate([
np.random.exponential(scale=10, size=900),
np.random.normal(50, 5, 100)
])
df_skewed = pd.DataFrame({'Values': skewed_data})
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
df_skewed['Width_Bin'].value_counts().sort_index().plot(kind='
bar', ax=ax1)
ax1.set_title('Equal-Width Binning (Skewed Data)')
df_skewed['Freq_Bin'].value_counts().plot(kind='bar', ax=ax2)
ax2.set_title('Equal-Frequency Binning (Skewed Data)')
plt.show()
112
Data Cleaning: Binning Example:
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Partition into (equal-frequency-length 3) bins:
• Bin1: 4,8,15 (max=15, min=4)
• Bin 2: 21, 21, 24 (max=24, min=21)
• Bin 3: 25, 28, 34 (max=34,min=25)
• Smoothing by bin means:
• Bin1: 9,9,9. (mean of Bin1=4+8+15/3 =9)
• Bin 2: 22, 22, 22 (mean of Bin2=21+21+24/3 =22)
• Bin 3: 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4,4,15. ( 4 is closer to min=4,replace 4 by 4. 8 is closer to min=4
so replace 8 by 4, 15 is closer to max=15,replace 15 by 15)
• Bin 2: 21, 21, 24 Use |x2-x1| as
• Bin 3: 25, 25, 34 closeness measure
113
Data Cleaning: Handling Noisy Data:
Regression
• Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
• Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict
the other.
• Y=mX+c
• Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to
a multidimensional surface.
• Y=m1X1+m2X2+c
• Example:
• Using various normal equations of X 0 1 2 3 4
114
Data Cleaning: Handling Noisy Data: Outlier
analysis
Outlier analysis: Outliers may be detected by clustering,
•where similar values are organized into groups, or “clusters.”
•Intuitively, values that fall outside of the set of clusters may be considered outliers
X 0 1 2 3 4
Y=x 0 1 0.5 3 4
Y=x2 0 1 0.5 9 16 Outlier=0.5
115
Data Integration:
Data integration:
Combines data from multiple sources into a coherent store.
Approaches in Data Integration
1. Entity identification problem:
2. Tuple Duplication:
3. Detecting and resolving data value conflicts
4. Redundancy and Correlation Analysis
116
1. Entity identification problem
• Schema integration:
Mismatching attribute names
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
e.g., A.cust-id B.cust-#
Integrate metadata from different sources
• Object matching: Mismatching structure of data
Ex: Discount issues
Currency type
117
2. Tuple
Duplication:
118
3. Detecting and resolving data
value conflicts
For the same real world entity, attribute values from different sources are
different.
Possible reasons:
different representations, Ex: Total sales for month single store/all stores
different scales, e.g., metric vs. British units (e.g., GPA in US and
China)
119
4. Redundancy and Correlation Analysis
120
4. Redundancy and Correlation Analysis
4. Redundancy and Correlation Analysis
Correlation
Correlation analysis is a method of statistical evaluation used to study the strength of a
relationship between two, numerically measured, continuous variables.
Computed as Correlation co-efficient
Value ranges between – (-1) to (+1)
Positively Correlated, Negatively correlated,
Not correlated
The strength of a correlation indicates how
strong the relationship is between the two
variables. The strength is determined by the
numerical value of the correlation
Example problem:
Business problem: The healthcare industry want to develop a
medication to control glucose levels. For this it want to study does
age
have impact on raise in glucose levels
GLUCOSE
SUBJECT AGE X XY X2 Y2 From our table:
LEVEL Y
Σx = 247
1 43 99 4257 1849 9801
Σy = 486
2 21 65 1365 441 4225 Σxy = 20,485
3 25 79 1975 625 6241 Σx2 = 11,409
4 42 75 3150 1764 5625
Σy2 = 40,022
n is the sample
5 57 87 4959 3249 7569
size=6
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
130
III- Data Reduction
Data Reduction
strategies:
• Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
132
1. Data cube aggregation
For example, the data consists of AllElectronics sales per quarter for the years 2014 to
2017.You are, however, interested in the annual sales, rather than the total per quarter. Thus,
the data can be aggregatedso that the resulting data summarize the total sales per year instead
of per quarter
133
2. Dimensionality Reduction
• Know about Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• What is Dimensionality reduction
• Method of eliminating irrelevant features so as to reduce noise
• Is proposed to avoid the curse of dimensionality
• Reduce time and space required in data mining
• Allow easier visualization of data (quite messy to visualize huge data)
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
134
2. Normalization
• An attribute is normalized by scaling its values so that they fall within a small specified
range, such as 0.0 to 1.0.
• Normalization is particularly useful for classification algorithms involving
– neural networks
– distance measurements such as nearest-neighbor classification and clustering.
• If using the neural network backpropagation algorithm for classification mining,
normalizing the input values for each attribute measured in the training instances will
help speed up the learning phase.
• For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from out-weighing attributes with initially smaller ranges (e.g.,
binary attributes).
• Normalization methods
I. Min-max normalization
II. z-score normalization
135
III. Normalization by decimal scaling
Min-max Normalization
• Min-max normalization
– performs a linear transformation on the original data.
• Suppose that:
– minA and maxA are the minimum and maximum values of
an attribute, A.
• Min-max normalization maps a value, v, of A to v′ in
the range [new_minA, new_maxA] by computing:
v' v minA (new _ maxA new _ minA) new _
minAmaxA minA
136
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 12,000
(1.0 0) 0 0.716
$73,000 is mapped to 98,000 12,000
73,600 54,000
1.225
• Ex. Let μ = 54,000, σ = 16,000. Then 16,000
• Normalization
v by decimal scaling
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
137
3. Data Aggregation
• On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the
annual sales
• Sales data for a given branch of AllElectronics for the years 2002 to 2004.
143
5. Generalization
• Generalization is the generation of concept hierarchies for categorical data
• Categorical attributes have a finite (but possibly large) number of distinct
values, with no ordering among the values.
• Examples include
– geographic location,
– job category, and
– itemtype.
• A relational database or a dimension location of a data warehouse may contain the
following group of attributes: street, city, province or state, and country.
• A user or expert can easily define a concept hierarchy by specifying ordering of the
attributes at the schema level.
• A hierarchy can be defined by specifying the total ordering among these attributes at the
schema level, such as:
◆ street < city < province or state < country
144
6. Discretization
• Three types of attributes:
• Nominal — values from an unordered set, e.g., color, profession
• Ordinal — values from an ordered set, e.g., military or academic rank
• Continuous — real numbers, e.g., integer or real numbers
• Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical attributes.
• Reduce data size by discretization
145
UNIT II – Exploratory Data
Analysis:
Hypothesis testing: t-Test, z-Test, Chi-Square-Test. Analysis of Variance
(ANOVA): One-way, Two-way. Multivariate Analysis: Mean Vector,
Covariance, Correlation and Precision Matrices, Multivariate Data,
Parameter Estimation, Estimation of Missing Values, Multivariate
Normal Distribution. Dimensionality Reduction: Principal Component
Analysis and Multi-Dimensional Scaling.
146
Data Science
UNIT-I
Hypothesis Testing
(Introduction, Z-Test, P-Test, T-Test)
147
Hypothesis Testing?
Hypothesis testing is a statistical method used to make decisions based
on data.
We always start with two competing statements:
•Null Hypothesis (H₀): There is no effect or no difference.
•Alternative Hypothesis (H₁ or Hₐ): There is an effect or a difference.
148
149
150
A vaccine company claims that their vaccine is at least 80% effective in
preventing COVID-19 infections. We want to test if this claim is statistically
valid using data from a sample of 1000 people who got the vaccine.
151
152
153
154
from statsmodels.stats.proportion import proportions_ztest
# Data
success = 770 # people not infected
n = 1000 # total vaccinated
# Test
stat, p_value = proportions_ztest(count=success, nobs=n, value=0.80, alternative='smaller')
print(f"Z-statistic: {stat:.2f}")
print(f"P-value: {p_value:.4f}")
# Decision
if p_value < 0.05:
print("Reject H₀: Evidence that vaccine effectiveness is less than 80%")
else:
print("Fail to reject H₀: No evidence that effectiveness is below 80%")
155
t-Test
A t-Test is a hypothesis test used to compare means when the
population standard deviation is unknown and the sample size is small (n
< 30 usually)
156
157
158
z-Test
A z-Test is a statistical test used to determine whether there is a
significant difference between sample and population means, or
between two sample means, when:
•The population standard deviation is known.
•The sample size is large (n ≥ 30) (Central Limit Theorem applies).
When to Use z-Test?
The data is normally distributed (or large enough sample size).
•The population standard deviation (σ) is known.
•You're testing:
•Sample mean vs Population mean
•Two sample means
159
Types of z-Tests
•One-sample z-Test: Compare sample mean with population mean.
•Two-sample z-Test: Compare means from two independent samples.
•z-Test for proportions (e.g., testing success rates).
160
The average batting score of a professional cricket player is believed to
be 40 runs per innings. A coach thinks that a particular player, Player
A, has improved and now scores more on average. To test this, he
collects data from the last 36 innings of Player A.
161
162
163
We want to compare Player A and Player B to check if Player A scores
more on average than Player B.
164
Player B (n₂ = 36 innings):
165
166
167
Chi-Square Test
The Chi-Square (χ²) Test is a non-parametric statistical test used to
compare observed data with expected data to determine if there’s a
significant association or difference.
168
• A sports analyst wants to know if there’s a relationship between a
cricket player's batting position and their performance (above
average vs below average).
169
170
171
172
173
174
175
DATA SCIENCE
UNIT II
ANOVA
From: Web resources
Contents
• Introduction
• Types of ANOVA
• ANOVA Procedure
• One way ANOVA
• Two way ANOVA
• Application
Introduction
• ANOVA is a statistical test for estimating how a quantitative dependent variable
changes according to the levels of one or more categorical independent variables
.
• ANOVA tests whether there is a difference in means of the groups at each level of
the independent variable.
• Analysis of variance (abbreviated as ANOVA)
• An extremely useful technique concerning researches in the many fields of economics,
biology, education, psychology, sociology, business/industry and in researches of several
other disciplines.
• This technique is used when multiple sample cases are involved.
• The ANOVA technique enables us to perform to examine the significance of the difference
amongst more than two sample means at the same time.
• Using this technique, one can draw inferences about whether the samples have been
drawn from populations having the same mean.
What is ANOVA ?
• Variance is an important statistical measure and is described as the mean
of the squares of deviations taken from the mean of the given series of
data. It is a frequently used measure of variation. square of standard
deviation is called variance.
• i.e., Variance = (standard deviation)2 .
• There may be variation between samples and also within sample items.
• In other words, ANOVA help us to figure out if there is need to
reject the null hypothesis or accept the alternate hypothesis
ANOVA
Examples:
A group of psychiatric patients are trying three different therapies: counseling,
medication and biofeedback. We want to see if one therapy is better than the
others.
A manufacturer has two different processes to make light bulbs. They want to know
if one process is better than the other.
Students from different colleges take the same exam. You want to see if one college
outperforms the other.
Types of
ANOVA
ANOVA is two types
One way ANOVA :
only one factor is investigate
one independent variable (with 2 levels)
Analysis of Variance could have one independent variable
196
Two way anova with
replication
Example 2 :set up anova table for the following information relating to the three
drugs testing to judge the effectiveness in reducing blood pressure for three different
groups of people
Drugs
X Y Z
Group of people 14 10 11
A 15 9 11
B 12 7 10
11 8 11
C 10 11 8
11 11 7
Computation for two way anova with repeated
values
Step (i) T = 187 , n = 18
The null
● hypothesis, symbolized by H0, is a statistical
hypothesis that states that there is no difference between a
parameter and a specific value or that there is no difference
between two parameters.
●
204
Steps in Hypothesis
Testing
A statistical test uses the data
●
obtained from a
sample to make a decision about whether or not the
null hypothesis should be rejected.
●
The numerical value obtained from a statistical test
is called the test value.
●
In the hypothesis-testing situation, there are four
possible outcomes.
●
In reality, the null hypothesis may or may not be
true, and a decision is made to reject or not to
reject it on the basis of the data obtained from a
sample.
205
Steps in Hypothesis Testing
H0 True H0 False
Error Correct
Reject Type I
H0 decisio
n
Correct Error
Do not
Type II
reject decisio
H0 n
206
Steps in Hypothesis
Testing
A type I error occurs if one
●
rejects the null hypothesis when
it is true.
●
A type II error occurs if one does not reject the null
hypothesis when it is false.
The level of significance is the maximum probability of
committing a type I error.
This probability is symbolized by (Greek letter alpha).
That is, P(type I error)=.
P(type II error) = (Greek letter beta).
207
Steps in Hypothesis
Testing
●
Typical significance levels are: 0.10, 0.05, and 0.01.
●
For example, when = 0.10, there is a 10% chance of rejecting a true
null hypothesis.
●
The critical value(s) separates the critical region from the noncritical
region.
●
The symbol for critical value is C.V.
●
The critical or rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis
should be rejected.
●
The noncritical or nonrejection region is the range of values of the test
value that indicates that the difference was probably due to chance and
that the null hypothesis should not be rejected.
208
Critical Value Approach
to Testing
Convert sample statistic (e.g.: ) to test
x
statistic ( Z or t statistic )
Determine the critical value(s) for a specified
level of significance from a table or
computer
critical value
σ
x μ
z n
Hypothesis Testing -
Example
A chemist invents an additive to
increase the life of an automobile
battery. If the mean lifetime of the
battery is 36 months, then his
hypotheses are
H0: 36 H1: 36
This is a right-tailed test.
213
Two Tailed
Tests
There are two cutoff values H0: μ = 3
(critical values): HA: μ
± zα/2
or /2 /2
xα/2
Lower
Reject H0 Do not reject H0 Reject H0
0 zα/2
xα/2 -
μ0
Upper
xα/2 xα/2
zα/2 Lower Upper
σ
x/2 μ /2
z n 19
214
Hypothesis Testing -
Example
A medical researcher is interested in finding out
whether a new medication will have any
undesirable side effects. The researcher is
particularly concerned with the pulse rate of the
patients who take the medication.
What are the hypotheses to test whether the
pulse rate will be different from the mean pulse
rate of 82 beats per minute?
●H0: = 82 H1: 82
●
This is a two-tailed test.
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
Data Science
Unit-II
Multivariate Analysis
232
Multivariant Data
Multivariate data refers to data that involves multiple variables or
measurements. Each observation in the dataset has values for multiple
attributes. Analyzing multivariate data is essential in many fields, including
statistics, data science, machine learning, and various scientific disciplines.
Characteristics of Multivariate Data
1.Multiple Variables: Multivariate data contains more than one variable or
attribute per observation.
2.Complexity: The relationships between variables can be complex,
requiring sophisticated statistical and computational methods for analysis.
3.Higher Dimensionality: With more variables, the data becomes high-
dimensional, which can lead to challenges in visualization and
interpretation.
Example: Healthcare, Financial, Environment
233
Multivariant Analysis
• Multivariate analysis encompasses all statistical techniques
that are used to analyze more than two variables at
once.
• The aim is to find patterns and correlations between several
variables simultaneously—allowing for a much deeper, more
complex understanding of a given scenario than you’ll get
with bivariate analysis.
There are three categories of analysis to be aware of:
• Univariate analysis, which looks at just one variable: Bar
charts, Pie charts, Histograms
• Bivariate analysis, which analyzes two variables, Ex:
Scatter plot, Regression and Correlation anlysis
• Multivariate analysis, which looks at more than two
variables Ex: Clustering etc
234
235
import numpy as np
# Dataset
data = np.array([[150, 50],
[160, 60],
[170, 65]])
# Mean Vector
mean_vector = np.mean(data, axis=0)
print("Mean Vector:\n", mean_vector)
236
237
238
239
240
corr_matrix = np.corrcoef(data.T)
print("Correlation Matrix:\n", corr_matrix)
241
242
243
Height and Weight have a very strong positive correlation (~0.98), meaning as height increases, weight also
tends to increase significantly.
244
precision_matrix = np.linalg.inv(cov_matrix)
print("Precision Matrix:\n", precision_matrix)
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Analysis of Variance (ANOVA)
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
Data Science
Unit-II
Principal Component Analysis (PCA)
Dimensionality Reduction
From : Web resources
277
Dimensionality Reduction: Principal
Component Analysis (PCA) and
Multi-Dimensional Scaling (MDS)
Dimensionality Reduction?
Dimensionality reduction is the process of reducing the number of
random variables under consideration.
It can be divided into:
•Feature selection: Selecting a subset of the original variables.
•Feature extraction: Transforming data into a lower-dimensional
space.
278
Why do we need Dimensionality Reduction?
•Visualization of high-dimensional data (e.g., 2D or 3D).
•Noise reduction: Eliminate less informative features.
•Computation efficiency: Faster training and prediction.
•Avoiding the curse of dimensionality: In high dimensions, data
becomes sparse.
Principal Component Analysis (PCA)
•PCA is a linear transformation technique.
•It projects the data into a lower-dimensional space while preserving
maximum variance.
279
Mathematical Steps:
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
Dimensionality Reduction:
In machine learning classification problems, there are often too many factors on the basis of which the final
classification is done. These factors are basically variables called features. The higher the number of features,
the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are
correlated, and hence redundant. This is where dimensionality reduction algorithms come into play.
Dimensionality reduction is the process of reducing the number of random variables under consideration, by
obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
302
Importance of Dimensionality reduction?
•It reduces the time and storage space required.
•It helps Remove multi-collinearity which improves the interpretation of the parameters of the
machine learning model.
•It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
•It avoids the curse of dimensionality.
•It removes irrelevant features from the data, Because having irrelevant features in the data can
decrease the accuracy of the models and make your model learn based on irrelevant features.
303
UNIT III – Predictive Analysis-I
Simple linear Regression: Model Building, Estimation of Parameters,
Interpret coefficients, Validation of model, Outlier analysis. Bias,
variance, and trade-off, Gradient descent, over and under fitting
models.
304
Validation of Linear Regression
Model
Bias, Variance, and the Trade-
off
Overfitting and Underfitting?