Session 4
Machine Learning Process
Learning Outcomes
• By the end of this lecture, you will be able to:
• Understand the process of developing a machine
learning model.
• Identify and explain each step in the machine learning
life cycle.
• Apply the machine learning life cycle to real-world
examples.
• Recognize common challenges and best practices in
each phase of the cycle.
Machine learning overview
• Machine learning is a subset of artificial intelligence (AI).
• Trains computers to mimic human thinking.
• Utilizes real-world data for training.
• It follows predefined steps to train computer
• This process is known as a machine learning lifecycle.
Steps in the Machine Learning Process
• Guides the development and deployment of machine
learning models.
• It’s a Structured process with various steps.
• Understanding the life cycle ensures:
• systematic development and deployment,
• improves efficiency, and
• enhances model performance.
Steps in the Machine Learning Process
• Prior to starting the process, you need toClearly define the
problem you aim to solve Problem Definition
Example: Predicting customer churn for a telecom
company [problem].
• Key Considerations: Business objectives, success metrics,
feasibility.
Step 1: Gathering Data
• Identify Data Sources
• Recognize where data can be collected from.
• Examples: Files, databases, internet, mobile devices.
• Collect Data
• Gather data from identified sources.
• Ensure data is relevant and comprehensive.
• Integrate Data
• Combine data from different sources.
• Create a coherent and unified dataset.
• Outcome
• Readytouse dataset for further processing.
Step 2: Data Preparation
• Raw data, is often messy and unstructured.
• Data cleaning involves addressing issues such as missing
values, outliers, and inconsistencies that could compromise the
accuracy and reliability of the machine learning model.
Objective
• Refine raw data for meaningful analysis.
• Lay the foundation for robust model development.
• The basic features of Data Cleaning and Preprocessing are
discussed next:
Step 2: Data Preparation
Data Cleaning
• Address missing values.
• Handle outliers.
• Resolve inconsistencies.
Data Preprocessing
• Standardize formats.
• Scale values.
• Encode categorical variables.
Step 2: Data Preparation
Data Quality
• Ensure well-organized data.
• Prepare for meaningful analysis.
Data Integrity
• Maintain dataset integrity.
• Effective cleaning and preprocessing.
Step 3: Data Wrangling
• The process of cleaning and converting raw data into a
useable format.
• It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next
step.
• Cleaning of data is required to address the quality issues.
Step 3: Data Wrangling
• In real-world applications, collected data may have
various issues, including:
Missing Values
Duplicate data
Invalid data
Noise (irrelevant or meaningless data)
• So, we use various filtering techniques to clean the data.
• It is mandatory to detect and remove the above issues
because it can negatively affect the quality of the
outcome.
Step 4: Analyze Data
• Also called “Exploratory Data Analysis (EDA) ”
• Understanding the underlying patterns and characteristics
of collected data.
• Leveraging statistical and visual tools to gain insights into
the dataset’s structure.
• Visualizations, summary statistics, and correlation
analyses play crucial role.
• Example of data visualization (e.g., histogram, scatter
plot).
Step 4: Analyze Data
• Exploration: Use statistical and visual tools to explore the
structure and patterns in the data.
• Patterns and Trends: Identify underlying patterns, trends,
and potential challenges within the dataset.
• Insights: Gain valuable insights to inform decisions in later
stages of the machine learning process.
• Decision Making: Use exploratory data analysis to make
informed decisions about feature engineering and model
selection.
Step 5: Feature Engineering and
Selection
• Feature Selection: Identify the subset of features that most
significantly impact the model’s performance.
• Feature Engineering: Create new features or transform
existing ones to better capture patterns and relationships.
• Requires domain expertise and a deep understanding of
the problem
• Aim is o engineer features that contribute meaningfully to
predictive power.
• Optimization: Balance feature set for predictive accuracy
while minimizing computational complexity.
Step 5: Feature Engineering and
Selection - Example using Python
Problem: to predict the `price` of houses using the available
features.
Dataset :Assume we have a dataset `house_data.csv` with the
following columns:
• house_id
• size_in_sqft
• num_bedrooms
• num_bathrooms
• location
• year_built
• price
Step 5: Feature Engineering and
Selection – Example using Python
Loading the Data:
Step 5: Feature Engineering and
Selection – Example using Python
Exploring the Data :
Step 5: Feature Engineering and
Selection – Example using Python
Handling Missing Values :
Step 5: Feature Engineering and
Selection – Example using Python
Feature Creation
• Total Rooms: Create a new feature by adding the number
of bedrooms and bathrooms :
Step 5: Feature Engineering and
Selection – Example using Python
Feature Creation
• Age of House: Create a new feature representing the age
of the house :
Step 5: Feature Engineering and
Selection – Example using Python
Feature Creation
• Age of House: Create a new feature representing the age
of the house :
Step 5: Feature Engineering and
Selection – Example using Python
Feature Creation
• Location Encoding: Convert categorical data into
numerical data. :
Step 5: Feature Engineering and
Selection – Example using Python
Feature Selection
• Drop less relevant or redundant features :
Step 6: Train Model
• Split the dataset into training and testing
Training Set: Used to train the model.
Testing Set: Used to evaluate the model.
• Select an appropriate machine learning algorithm
Regression: Linear Regression, Ridge, Lasso, etc.
Classification: Logistic Regression, Decision Trees, Random Forest,
SVM, etc.
Clustering: K-Means, Hierarchical Clustering, etc.
• Train the model
Step 7: Model Evaluation
• Test the model to determine the percentage accuracy of
the model.
• Involves rigorous testing against validation datasets.
• Evaluation metrics such as accuracy, precision, recall, and
F1 score are computed to gauge its effectiveness.
• Provides insights into the model’s strengths and
weaknesses.
Step 7: Model Deployment
• We deploy the model in the real-world system.
• The deployment phase is similar to making the final report
for a project.
Next Steps
1. Install Python compatible IDE (Integrated Development
Environment).
2. Install Weka Machine Learning Environment
Assignment:
1. Describe the following machine learning processes:
a. CRISP-DM
b. SEMMA
c. KDD
(6 marks)
2. Identify the key differences and similarities among the
data miming (KDD) and machine learning (CRISP-DM,
SEMMA) processes? (4 marks)
Submit by: 19/05/2025 (hard copy)