0% found this document useful (0 votes)

7 views4 pages

1data Preprocessing

The document outlines the steps for data preprocessing, including cleaning, feature scaling, encoding categorical variables, and feature engineering, using popular Python libraries like pandas and sklearn. It also provides guidance on setting up Visual Studio Code for Python development and utilizing Kaggle for dataset exercises, with a specific example of a pipeline for the Titanic dataset. The example demonstrates importing libraries, loading data, preprocessing, training a model, and evaluating its accuracy.

Uploaded by

rajesh.a04082004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

1data Preprocessing

Uploaded by

rajesh.a04082004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

 Data Preprocessing:

 Cleaning the data: This could involve handling missing values (e.g., using imputation
or dropping rows), outliers, or duplicates.
 Feature scaling: Standardization or normalization (especially important for models
like KNN, SVM, and neural networks).
 Encoding categorical variables: Converting categorical data to numerical format
using techniques like one-hot encoding or label encoding.
 Feature engineering: Creating new features or selecting the most relevant ones to
improve model performance.

Popular Python libraries for this:

 pandas for data manipulation

 sklearn.preprocessing for scaling and encoding

 numpy for numerical operations

 Working with Visual Studio Code:

 Install Python extensions in VS Code for better functionality, such as Python, Jupyter,
and Pylance.
 Make sure to set up a virtual environment to manage dependencies. You can use venv
or conda for this.
 Use Jupyter notebooks within VS Code for interactive data exploration and testing
out models.

 Kaggle Dataset Exercises:

 Kaggle is a goldmine for learning. You can explore competitions, kernels (notebooks),
and datasets for practice.
 Download the datasets and load them into your Python environment. After
preprocessing the data, you can experiment with different models (e.g., Decision Trees,
Random Forest, XGBoost, or even neural networks if you’re feeling adventurous).

 Getting Started with a Kaggle Exercise:

 Download a dataset from Kaggle, say the Titanic dataset (for classification) or House
Prices (for regression).
 Start by exploring the data (using pandas and matplotlib/seaborn for visualization).
 Preprocess the data: handle missing values, encode categories, and scale the features.
 Train a basic model (Logistic Regression for Titanic, Linear Regression for House
Prices) using sklearn and evaluate it.
 Gradually improve your model by experimenting with different algorithms,
hyperparameters, and feature engineering.

Example Pipeline in Python (Titanic Dataset):

# 1. Import Libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# 2. Load Data

data = pd.read_csv('titanic.csv')

# 3. Data Preprocessing

# Fill missing values

data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Encode categorical columns

data = pd.get_dummies(data, columns=['Sex', 'Embarked'])

# Select features and target

X = data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',

'Embarked_S']]

y = data['Survived']

# 4. Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Feature Scaling

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# 6. Train Model

model = LogisticRegression()

model.fit(X_train_scaled, y_train)

# 7. Evaluate Model

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

Beginner's Guide To Accounting
100% (3)
Beginner's Guide To Accounting
70 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Solution For "Financial Statement Analysis" Penman 5th Edition
64% (28)
Solution For "Financial Statement Analysis" Penman 5th Edition
16 pages
Top Datasets for Data Science
100% (1)
Top Datasets for Data Science
9 pages
Pattern Recognition Lab
No ratings yet
Pattern Recognition Lab
24 pages
Oral-Communications Q2 Module-3
No ratings yet
Oral-Communications Q2 Module-3
15 pages
NTA IGNOU PHD Entrance Exam Syllabus
No ratings yet
NTA IGNOU PHD Entrance Exam Syllabus
85 pages
Management by Walking Around
100% (2)
Management by Walking Around
7 pages
Geology of The Area
No ratings yet
Geology of The Area
4 pages
Kaggle Tutorial 1
No ratings yet
Kaggle Tutorial 1
29 pages
Data Modeling - Cheatsheet
No ratings yet
Data Modeling - Cheatsheet
9 pages
U.S. Foreign Assistance To Somalia - Phoenix From The Ashes
No ratings yet
U.S. Foreign Assistance To Somalia - Phoenix From The Ashes
26 pages
Maintenance of Capital
No ratings yet
Maintenance of Capital
36 pages
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
No ratings yet
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
22 pages
Accounting for Financial Liabilities
100% (1)
Accounting for Financial Liabilities
71 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Service Manual: Viewsonic Pjd6211
No ratings yet
Service Manual: Viewsonic Pjd6211
60 pages
20AI16 - ML Record
No ratings yet
20AI16 - ML Record
24 pages
Harley-Davidson Procurement Software Selection
No ratings yet
Harley-Davidson Procurement Software Selection
3 pages
Data Preprocessing and Model Training
No ratings yet
Data Preprocessing and Model Training
21 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
3 pages
3.1 ML Data Science Syllabus PDF
No ratings yet
3.1 ML Data Science Syllabus PDF
4 pages
Purposive Communication - Lesson 3
No ratings yet
Purposive Communication - Lesson 3
7 pages
ML Lab Syllabus for Students
No ratings yet
ML Lab Syllabus for Students
90 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
17 pages
Machine Learning Project Guide
No ratings yet
Machine Learning Project Guide
9 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
Camay Relaunch in Pakistan
100% (1)
Camay Relaunch in Pakistan
26 pages
ML - LAB - FILE Amrit
No ratings yet
ML - LAB - FILE Amrit
13 pages
Trial Memorandum Plaintiff SAMPLE
No ratings yet
Trial Memorandum Plaintiff SAMPLE
9 pages
Python Machine Learning Practical Guide
No ratings yet
Python Machine Learning Practical Guide
13 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Adcps: Question Paper Cum Answer Sheet
No ratings yet
Adcps: Question Paper Cum Answer Sheet
5 pages
Concrete Prestressing Guide
No ratings yet
Concrete Prestressing Guide
23 pages
Data Science
No ratings yet
Data Science
8 pages
Alundra2-Exact Location of Puzzle Pieces
No ratings yet
Alundra2-Exact Location of Puzzle Pieces
3 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
PythonForML2023 Laboratory07 08 Regression Classification Update2
No ratings yet
PythonForML2023 Laboratory07 08 Regression Classification Update2
6 pages
Easter Events & Weather Forecast
No ratings yet
Easter Events & Weather Forecast
10 pages
Supervised ML with Flask & Docker
No ratings yet
Supervised ML with Flask & Docker
30 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Tushar ML
No ratings yet
Tushar ML
52 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Coffee Cost Breakdown & Gen Z Work Preferences
No ratings yet
Coffee Cost Breakdown & Gen Z Work Preferences
2 pages
Week 3
No ratings yet
Week 3
10 pages
Assignment 3 DL
No ratings yet
Assignment 3 DL
6 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Mechatronics Project: Linear Displacement Indicator
No ratings yet
Mechatronics Project: Linear Displacement Indicator
6 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
ML Viva Practice (Answers)
No ratings yet
ML Viva Practice (Answers)
4 pages
MLCyber Lab
No ratings yet
MLCyber Lab
9 pages
Repair & Rehab of Structures Course
No ratings yet
Repair & Rehab of Structures Course
2 pages
Circles The Final Steps (MCQ'S) Ws
No ratings yet
Circles The Final Steps (MCQ'S) Ws
9 pages
Ictasol
No ratings yet
Ictasol
1 page
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Capstone Project Ree
No ratings yet
Capstone Project Ree
6 pages
AI
No ratings yet
AI
16 pages
TL-30 Datasheet - UDNC
No ratings yet
TL-30 Datasheet - UDNC
2 pages
Advance Python
No ratings yet
Advance Python
5 pages
AAM PR QB
No ratings yet
AAM PR QB
13 pages
FA I - Unit5
No ratings yet
FA I - Unit5
11 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
Action PlanJournaling
No ratings yet
Action PlanJournaling
7 pages
3 Month AI Architect Learning Program
No ratings yet
3 Month AI Architect Learning Program
3 pages
Rare Project-2023-24 - 230614 - 163032
No ratings yet
Rare Project-2023-24 - 230614 - 163032
6 pages
Blocked Credit Under GST
No ratings yet
Blocked Credit Under GST
15 pages
Data Analytics I
No ratings yet
Data Analytics I
4 pages
05 Dispute
No ratings yet
05 Dispute
29 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
2 pages
AHP Template SCBUK
No ratings yet
AHP Template SCBUK
24 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Numpy: Explanation
No ratings yet
Numpy: Explanation
21 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Week1 Exploratory Data Analysis
No ratings yet
Week1 Exploratory Data Analysis
2 pages
Python Syntax and Functions For Data Mining
No ratings yet
Python Syntax and Functions For Data Mining
6 pages
Improvement of Supply Chain Performance of Printin
No ratings yet
Improvement of Supply Chain Performance of Printin
12 pages
ML - Assignment Advanced
No ratings yet
ML - Assignment Advanced
2 pages
Untitled Document
No ratings yet
Untitled Document
4 pages

1data Preprocessing

Uploaded by

1data Preprocessing

Uploaded by

 Data Preprocessing:

Popular Python libraries for this:

 pandas for data manipulation

 sklearn.preprocessing for scaling and encoding

 numpy for numerical operations

 Working with Visual Studio Code:

 Kaggle Dataset Exercises:

 Getting Started with a Kaggle Exercise:

Example Pipeline in Python (Titanic Dataset):

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Fill missing values

# Encode categorical columns

data = pd.get_dummies(data, columns=['Sex', 'Embarked'])

# Select features and target

X = data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

accuracy = accuracy_score(y_test, y_pred)

You might also like