0% found this document useful (0 votes)

23 views9 pages

Machine Learning

Uploaded by

atiwari60be23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views9 pages

Machine Learning

Uploaded by

atiwari60be23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

MACHINE LEARNING

(UEC520)

SUBMITTED BY:
Rohit. Singla. 102306033 (L)
Ojasvi Vashisht 102306043
Jaiditya kapoor 102306136
Abhay Tiwari 102356001

SUBGROUP: 3F12

SUBMITTED TO:
Dr. GITANJALI CHANDWANI MONACHA
EXPERIMENT -1
To apply various Data Preprocessing Techniques on a dataset to prepare it for ML
Algorithms

Introduction to Data Preprocessing:

What is Data Preprocessing?

• Data preprocessing is a crucial and foundational step in the field of data science and
machine learning. In today's digital age, data is generated in massive amounts from
diverse sources like sensors, mobile devices, social media, and medical equipment.
While this abundance of data is valuable, it's rarely in a clean, usable format. Raw data
often contains errors, missing values, and inconsistencies, which can significantly hinder
the performance of any analytical model.
• The primary goal of data preprocessing is to transform this raw, "dirty" data into a clean,
organized, and meaningful format. This essential process makes the data ready for
analysis and ensures that machine learning algorithms can produce accurate and reliable
results. Think of it as preparing ingredients before you start cooking; a good chef knows
that the quality of the final dish depends on the preparation of the raw ingredients.
Similarly, the success of any data analysis or machine learning project hinges on the
quality of the data, which is largely determined by effective data preprocessing.
• The diagram in the provided image illustrates the main components of data
preprocessing: Data Cleaning, Data Integration, and Data Transformation. Each of
these components involves a set of techniques designed to address specific issues in the
data, ensuring it is in the best possible state for subsequent analysis.

Why is Data Preprocessing Important?

In today's digital world, data is collected from many different sources, such as sensors, social
media, and financial systems. This raw data is often incomplete, inconsistent, or contains errors
and missing values. If this "dirty" data is used directly, it can significantly harm the performance
of machine learning models.

Data preprocessing helps to solve this problem by:

• Improving Accuracy: Cleaning the data of errors and inconsistencies leads to more
accurate and reliable results.
• Boosting Efficiency: Preprocessed data is easier and faster for algorithms to process.
• Ensuring Consistency: It standardizes data formats and values, so different data sources
can be used together effectively.
•
Key Techniques in Data Preprocessing :Data preprocessing involves several core activities:

• Data Cleaning: This is the process of fixing or removing errors, inconsistencies, and
missing values in the data. For example, if a data entry is "45 years" instead of "45", data
cleaning would fix it.
• Data Transformation: This technique converts data into a more suitable format. This
can include normalization (scaling values to a specific range) or aggregation
(summarizing data into a single value).
• Data Integration: This combines data from multiple sources into a single, cohesive
dataset. This is essential when you need to analyze information from various databases
or files.
• Data Reduction: This technique aims to decrease the volume of data while maintaining
its integrity. It helps to reduce storage space and processing time. This can be done
through dimensionality reduction (reducing the number of attributes) or numerosity
reduction (reducing the number of data points).

Figure 1. Data Preprocessing Cycle[1]

Preprocessing Techniques Applied to the Dataset
The dataset provided contains two attributes: Years of Experience and Salary with 30
observations. Both attributes are numerical and continuous in nature. Since this dataset is
comparatively small and already well-structured, heavy preprocessing steps are not required.
However, for ensuring reliability and preparing the data for machine learning models such as
Linear Regression, the following preprocessing techniques were applied.

1. Handling Missing Values

One of the most common problems in real-world datasets is the presence of missing values.
These can occur due to errors in data collection, transmission issues, or manual entry mistakes.
In this dataset, on careful inspection, no missing values were found in either Years of Experience
or Salary. Therefore, no imputation technique such as mean substitution, median replacement,
or forward/backward filling was required. This makes the dataset ready for direct use without
additional cleaning in this aspect.

2. Encoding Categorical Data

The 'Country' column is a categorical variable with values like 'France', 'Spain', and 'Germany'.
Machine learning models require numerical input, so these textual categories must be converted.
One-Hot Encoding is applied to this column using OneHotEncoder within a
ColumnTransformer. This process creates a new binary column for each unique country, where
a 1.0 indicates the presence of that country and a 0.0 indicates its absence. The remaining
columns ('Age' and 'Salary') are passed through unchanged using remainder='passthrough'.
Pseudo Code:

// START of the Data Preprocessing Experiment

// 1. Load the initial dataset

// The dataset contains columns: 'Country', 'Age', 'Salary', 'Purchased'.
LOAD data from 'Data.csv' INTO a pandas DataFrame.
SEPARATE the features (X) from the target variable (y).
- X = all columns EXCEPT 'Purchased'
- y = the 'Purchased' column
PRINT the initial dataset to show the raw data, including missing values (NaN).
// 2. Handle Missing Values
// Missing values are in the 'Age' and 'Salary' columns.
- Use a SimpleImputer from sklearn.impute.
- Set the strategy to 'mean' to replace NaN values with the column's mean.
CREATE a SimpleImputer instance with 'mean' strategy.
FIT the imputer to the 'Age' and 'Salary' columns of X.
TRANSFORM the 'Age' and 'Salary' columns of X using the fitted imputer.
PRINT the updated X to show the dataset with missing values replaced.

// 3. Encode Categorical Data

// The 'Country' column is a categorical variable that needs to be converted to numerical
format.
- Use One-Hot Encoding via a ColumnTransformer.
- The OneHotEncoder will be applied to the first column (Country).
- The remaining columns ('Age', 'Salary') will be passed through unchanged.
CREATE a ColumnTransformer instance.
- Set transformer to 'encoder' with OneHotEncoder applied to column [0].
- Set remainder to 'passthrough'.
APPLY the ColumnTransformer to X to perform One-Hot Encoding.
- The 'Country' column is replaced by three new binary columns (e.g., [1, 0, 0] for 'France').
- The 'Age' and 'Salary' columns are kept as they are.
CONVERT the result to a NumPy array.

// 4. Feature Scaling (as seen in the code)

// The code in the document also shows feature scaling using MinMaxScaler.
- Apply MinMaxScaler to the numerical columns ('Age' and 'Salary') to normalize their values.
CREATE a MinMaxScaler instance.
FIT and TRANSFORM the 'Age' and 'Salary' columns of X using the scaler.
ROUND the scaled values to two decimal places.
PRINT the final preprocessed feature set.
// END of the Data Preprocessing Experiment
Results of Preprocessing:

The preprocessing steps transform the original raw dataset into a clean, numerical
matrix ready for machine learning algorithms.

Initial Dataset:

The original dataset, before any preprocessing, looks like this, showing NaN values
in the 'Age' and 'Salary' columns.
Code:

import numpy as np
import pandas as pd
df=pd.read_csv('/Data.csv')
print(df)
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values
Output:
Code:

from sklearn.impute import SimpleImputer

imputer =
SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:3])
x[:,1:3]=imputer.transform(x[:,1:3])
print(x)

Output :

Final Pre-processed Dataset :

After handling missing values and one-hot encoding, the x variable (features) is
transformed into a numerical array.

• The first three columns [1.0 0.0 0.0], [0.0 0.0 1.0], and [0.0 1.0 0.0] represent the
one-hot encoded countries (France, Spain, and Germany, respectively).
• The subsequent columns contain the numerical 'Age' and 'Salary' data, with the
NaN values replaced by their respective column means.
Code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder
(),[0])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
Scaler =MinMaxscaler()
x[:, 3:] = np.round(scaler.fit_transform(x[:, 3:]),2)
print(x)

The final preprocessed feature set looks like this:

GROUP MEMBER SUGGESTIONS
(1) Jaiditya kapoor (102306136):
The document is well-organized and serves as a strong foundation for an
experiment report. By addressing the minor inconsistencies regarding the dataset's
attributes and the presence of missing values, the report will be even more
accurate and effective. The inclusion of the group members' suggestions at the
end is a nice touch and shows that their feedback was incorporated. The document
is a great example of a practical application of data preprocessing.

(2) Ojasvi Vashisht (102306043):

This document provides a good overview of data preprocessing, effectively linking theory
to a practical example. The report's structure is logical, and the inclusion of code and outputs
is a major strength. However, there are a few inconsistencies and areas for improvement to
enhance its accuracy and clarity.

(3) Abhay Tiwari (102356001):

This report is a solid start to documenting a machine learning experiment. It
effectively covers the key aspects of data preprocessing and provides practical
examples. However, a few areas can be refined to improve its accuracy,
consistency, and overall clarity.

REFERENCES

1. https://medium.com/@tiami.abiola/data-preprocessing-essential-rolein-
machinelearning-258b8d9bd7e4

Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
Maths Grade-8 Model 2015
No ratings yet
Maths Grade-8 Model 2015
7 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
ML
No ratings yet
ML
21 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Lab 06
No ratings yet
Lab 06
12 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
8500W Installation-Manual
100% (1)
8500W Installation-Manual
21 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
46 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data Preprocessing and Cleaning For Machine Learning
No ratings yet
Data Preprocessing and Cleaning For Machine Learning
16 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
Data Mining - Lab 1
No ratings yet
Data Mining - Lab 1
4 pages
Walmart Factory List
100% (2)
Walmart Factory List
5 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
EX2 - BIGDATA - San
No ratings yet
EX2 - BIGDATA - San
9 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Experiment-3 31
No ratings yet
Experiment-3 31
9 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
COMP6981-DataPreproc ASoares Online
No ratings yet
COMP6981-DataPreproc ASoares Online
2 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Advance Python
No ratings yet
Advance Python
5 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Unit 2
No ratings yet
Unit 2
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
7 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
11 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
MCQ 3 Aiml
No ratings yet
MCQ 3 Aiml
2 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Document 1
No ratings yet
Document 1
4 pages
AI Data Preprocessing Guide
No ratings yet
AI Data Preprocessing Guide
12 pages
Ubd Graphing Slope-Intercept Form
No ratings yet
Ubd Graphing Slope-Intercept Form
4 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Module 7 Intangibles
No ratings yet
Module 7 Intangibles
14 pages
Highway Alignment Principles
60% (5)
Highway Alignment Principles
89 pages
Recovery CDs
No ratings yet
Recovery CDs
6 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
UPI Transactiosn Frauds in India
No ratings yet
UPI Transactiosn Frauds in India
4 pages
Clinical Microbiology MCQ Practice Test
100% (4)
Clinical Microbiology MCQ Practice Test
13 pages
The Need of MEMS
No ratings yet
The Need of MEMS
22 pages
Smit Vipul Kalamkar - CV
No ratings yet
Smit Vipul Kalamkar - CV
2 pages
Richland Technologies 5th Anniversary Press Release
No ratings yet
Richland Technologies 5th Anniversary Press Release
2 pages
Kopi
No ratings yet
Kopi
5 pages
Unit 2 Assignment CGC1W
No ratings yet
Unit 2 Assignment CGC1W
2 pages
Mitosis Lecture PDF
No ratings yet
Mitosis Lecture PDF
11 pages
Material Safety Data Sheet Avafulflow
No ratings yet
Material Safety Data Sheet Avafulflow
4 pages
STID1103 SYLLABUS A211 Student
No ratings yet
STID1103 SYLLABUS A211 Student
5 pages
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
No ratings yet
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
56 pages
Day Trading Capital Management Plan
No ratings yet
Day Trading Capital Management Plan
38 pages
Carbon-14: 2 Radiocarbon Dating
No ratings yet
Carbon-14: 2 Radiocarbon Dating
6 pages
Response of Framed Buildings To Excavation-Induced Movements
No ratings yet
Response of Framed Buildings To Excavation-Induced Movements
19 pages
Hitch Climbers Guide
No ratings yet
Hitch Climbers Guide
28 pages
Pawan Transfer
No ratings yet
Pawan Transfer
2 pages
Consent Document For Enrolling Adult Participants in A Research Study
No ratings yet
Consent Document For Enrolling Adult Participants in A Research Study
3 pages
IJRPR15453
No ratings yet
IJRPR15453
7 pages
Sample CV
No ratings yet
Sample CV
6 pages
Class 11 Physics Exam Paper
No ratings yet
Class 11 Physics Exam Paper
4 pages
Parthavi Electricals
No ratings yet
Parthavi Electricals
11 pages
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
No ratings yet
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
16 pages

Machine Learning

Uploaded by

Machine Learning

Uploaded by

MACHINE LEARNING

Introduction to Data Preprocessing:

Why is Data Preprocessing Important?

Data preprocessing helps to solve this problem by:

Figure 1. Data Preprocessing Cycle[1]

1. Handling Missing Values

2. Encoding Categorical Data

// START of the Data Preprocessing Experiment

// 1. Load the initial dataset

// 3. Encode Categorical Data

// 4. Feature Scaling (as seen in the code)

from sklearn.impute import SimpleImputer

Final Pre-processed Dataset :

The final preprocessed feature set looks like this:

(2) Ojasvi Vashisht (102306043):

(3) Abhay Tiwari (102356001):

You might also like