Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views9 pages

Machine Learning

Uploaded by

atiwari60be23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

Machine Learning

Uploaded by

atiwari60be23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MACHINE LEARNING

(UEC520)

SUBMITTED BY:
Rohit. Singla. 102306033 (L)
Ojasvi Vashisht 102306043
Jaiditya kapoor 102306136
Abhay Tiwari 102356001

SUBGROUP: 3F12

SUBMITTED TO:
Dr. GITANJALI CHANDWANI MONACHA
EXPERIMENT -1
To apply various Data Preprocessing Techniques on a dataset to prepare it for ML
Algorithms

Introduction to Data Preprocessing:


What is Data Preprocessing?

• Data preprocessing is a crucial and foundational step in the field of data science and
machine learning. In today's digital age, data is generated in massive amounts from
diverse sources like sensors, mobile devices, social media, and medical equipment.
While this abundance of data is valuable, it's rarely in a clean, usable format. Raw data
often contains errors, missing values, and inconsistencies, which can significantly hinder
the performance of any analytical model.
• The primary goal of data preprocessing is to transform this raw, "dirty" data into a clean,
organized, and meaningful format. This essential process makes the data ready for
analysis and ensures that machine learning algorithms can produce accurate and reliable
results. Think of it as preparing ingredients before you start cooking; a good chef knows
that the quality of the final dish depends on the preparation of the raw ingredients.
Similarly, the success of any data analysis or machine learning project hinges on the
quality of the data, which is largely determined by effective data preprocessing.
• The diagram in the provided image illustrates the main components of data
preprocessing: Data Cleaning, Data Integration, and Data Transformation. Each of
these components involves a set of techniques designed to address specific issues in the
data, ensuring it is in the best possible state for subsequent analysis.

Why is Data Preprocessing Important?

In today's digital world, data is collected from many different sources, such as sensors, social
media, and financial systems. This raw data is often incomplete, inconsistent, or contains errors
and missing values. If this "dirty" data is used directly, it can significantly harm the performance
of machine learning models.

Data preprocessing helps to solve this problem by:

• Improving Accuracy: Cleaning the data of errors and inconsistencies leads to more
accurate and reliable results.
• Boosting Efficiency: Preprocessed data is easier and faster for algorithms to process.
• Ensuring Consistency: It standardizes data formats and values, so different data sources
can be used together effectively.

Key Techniques in Data Preprocessing :Data preprocessing involves several core activities:

• Data Cleaning: This is the process of fixing or removing errors, inconsistencies, and
missing values in the data. For example, if a data entry is "45 years" instead of "45", data
cleaning would fix it.
• Data Transformation: This technique converts data into a more suitable format. This
can include normalization (scaling values to a specific range) or aggregation
(summarizing data into a single value).
• Data Integration: This combines data from multiple sources into a single, cohesive
dataset. This is essential when you need to analyze information from various databases
or files.
• Data Reduction: This technique aims to decrease the volume of data while maintaining
its integrity. It helps to reduce storage space and processing time. This can be done
through dimensionality reduction (reducing the number of attributes) or numerosity
reduction (reducing the number of data points).

Figure 1. Data Preprocessing Cycle[1]


Preprocessing Techniques Applied to the Dataset
The dataset provided contains two attributes: Years of Experience and Salary with 30
observations. Both attributes are numerical and continuous in nature. Since this dataset is
comparatively small and already well-structured, heavy preprocessing steps are not required.
However, for ensuring reliability and preparing the data for machine learning models such as
Linear Regression, the following preprocessing techniques were applied.

1. Handling Missing Values

One of the most common problems in real-world datasets is the presence of missing values.
These can occur due to errors in data collection, transmission issues, or manual entry mistakes.
In this dataset, on careful inspection, no missing values were found in either Years of Experience
or Salary. Therefore, no imputation technique such as mean substitution, median replacement,
or forward/backward filling was required. This makes the dataset ready for direct use without
additional cleaning in this aspect.

2. Encoding Categorical Data

The 'Country' column is a categorical variable with values like 'France', 'Spain', and 'Germany'.
Machine learning models require numerical input, so these textual categories must be converted.
One-Hot Encoding is applied to this column using OneHotEncoder within a
ColumnTransformer. This process creates a new binary column for each unique country, where
a 1.0 indicates the presence of that country and a 0.0 indicates its absence. The remaining
columns ('Age' and 'Salary') are passed through unchanged using remainder='passthrough'.
Pseudo Code:

// START of the Data Preprocessing Experiment

// 1. Load the initial dataset


// The dataset contains columns: 'Country', 'Age', 'Salary', 'Purchased'.
LOAD data from 'Data.csv' INTO a pandas DataFrame.
SEPARATE the features (X) from the target variable (y).
- X = all columns EXCEPT 'Purchased'
- y = the 'Purchased' column
PRINT the initial dataset to show the raw data, including missing values (NaN).
// 2. Handle Missing Values
// Missing values are in the 'Age' and 'Salary' columns.
- Use a SimpleImputer from sklearn.impute.
- Set the strategy to 'mean' to replace NaN values with the column's mean.
CREATE a SimpleImputer instance with 'mean' strategy.
FIT the imputer to the 'Age' and 'Salary' columns of X.
TRANSFORM the 'Age' and 'Salary' columns of X using the fitted imputer.
PRINT the updated X to show the dataset with missing values replaced.

// 3. Encode Categorical Data


// The 'Country' column is a categorical variable that needs to be converted to numerical
format.
- Use One-Hot Encoding via a ColumnTransformer.
- The OneHotEncoder will be applied to the first column (Country).
- The remaining columns ('Age', 'Salary') will be passed through unchanged.
CREATE a ColumnTransformer instance.
- Set transformer to 'encoder' with OneHotEncoder applied to column [0].
- Set remainder to 'passthrough'.
APPLY the ColumnTransformer to X to perform One-Hot Encoding.
- The 'Country' column is replaced by three new binary columns (e.g., [1, 0, 0] for 'France').
- The 'Age' and 'Salary' columns are kept as they are.
CONVERT the result to a NumPy array.

// 4. Feature Scaling (as seen in the code)


// The code in the document also shows feature scaling using MinMaxScaler.
- Apply MinMaxScaler to the numerical columns ('Age' and 'Salary') to normalize their values.
CREATE a MinMaxScaler instance.
FIT and TRANSFORM the 'Age' and 'Salary' columns of X using the scaler.
ROUND the scaled values to two decimal places.
PRINT the final preprocessed feature set.
// END of the Data Preprocessing Experiment
Results of Preprocessing:

The preprocessing steps transform the original raw dataset into a clean, numerical
matrix ready for machine learning algorithms.

Initial Dataset:

The original dataset, before any preprocessing, looks like this, showing NaN values
in the 'Age' and 'Salary' columns.
Code:

import numpy as np
import pandas as pd
df=pd.read_csv('/Data.csv')
print(df)
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values
Output:
Code:

from sklearn.impute import SimpleImputer


imputer =
SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:3])
x[:,1:3]=imputer.transform(x[:,1:3])
print(x)

Output :

Final Pre-processed Dataset :

After handling missing values and one-hot encoding, the x variable (features) is
transformed into a numerical array.

• The first three columns [1.0 0.0 0.0], [0.0 0.0 1.0], and [0.0 1.0 0.0] represent the
one-hot encoded countries (France, Spain, and Germany, respectively).
• The subsequent columns contain the numerical 'Age' and 'Salary' data, with the
NaN values replaced by their respective column means.
Code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder
(),[0])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
Scaler =MinMaxscaler()
x[:, 3:] = np.round(scaler.fit_transform(x[:, 3:]),2)
print(x)

The final preprocessed feature set looks like this:


GROUP MEMBER SUGGESTIONS
(1) Jaiditya kapoor (102306136):
The document is well-organized and serves as a strong foundation for an
experiment report. By addressing the minor inconsistencies regarding the dataset's
attributes and the presence of missing values, the report will be even more
accurate and effective. The inclusion of the group members' suggestions at the
end is a nice touch and shows that their feedback was incorporated. The document
is a great example of a practical application of data preprocessing.

(2) Ojasvi Vashisht (102306043):


This document provides a good overview of data preprocessing, effectively linking theory
to a practical example. The report's structure is logical, and the inclusion of code and outputs
is a major strength. However, there are a few inconsistencies and areas for improvement to
enhance its accuracy and clarity.

(3) Abhay Tiwari (102356001):


This report is a solid start to documenting a machine learning experiment. It
effectively covers the key aspects of data preprocessing and provides practical
examples. However, a few areas can be refined to improve its accuracy,
consistency, and overall clarity.

REFERENCES

1. https://medium.com/@tiami.abiola/data-preprocessing-essential-rolein-
machinelearning-258b8d9bd7e4

You might also like