UNIT 3 DATA SCIENCE
🔹 1. Business Understanding
✅ Definition:
This is the first and most critical phase of a data science project. It involves gaining a clear
understanding of the business context, goals, objectives, and problems to be solved.
✅ Key Objectives:
Understand the domain and industry.
Translate business problems into data science problems.
Define success criteria from a business perspective.
✅ Tasks Involved:
Meeting stakeholders (e.g., marketing, finance, sales)
Identifying pain points (e.g., churn, low sales)
Formulating problem statements (e.g., “Predict customer churn”)
Understanding constraints (budget, time, data availability)
✅ Example:
If a retail company has declining sales, the business problem might be:
“Identify key reasons for sales drop and predict future sales to improve stock management.”
🔹 2. Analytics Approach
✅ Definition:
This phase outlines the analytical techniques and methodologies suitable for solving the
defined business problem.
✅ Key Objectives:
Decide whether the problem is classification, regression, clustering,
recommendation, etc.
Select modeling techniques: statistical models, machine learning, deep learning, etc.
Define performance metrics: accuracy, precision, recall, RMSE, etc.
✅ Tasks Involved:
Mapping problem to algorithms
Selecting evaluation strategies (cross-validation, A/B testing)
Considering data types (structured/unstructured)
✅ Example:
If the task is to identify customer churn, logistic regression or decision trees could be
appropriate approaches.
🔹 3. Data Requirements
✅ Definition:
In this step, the data scientist defines what kind of data is needed to perform the analysis or
modeling.
✅ Key Objectives:
Identify data sources and required attributes.
Determine volume, velocity, and variety of data.
Understand data granularity and frequency.
✅ Tasks Involved:
Listing required data columns (e.g., age, purchase history)
Data sampling or aggregation needs
Noting privacy/security constraints
✅ Example:
For customer behavior analysis, data might be required on demographics, browsing history,
purchase history, and support tickets.
🔹 4. Data Collection
✅ Definition:
This is the process of gathering the required data from internal or external sources.
✅ Key Objectives:
Acquire data from reliable sources.
Ensure data accessibility and availability.
Maintain data integrity during collection.
✅ Tasks Involved:
Extracting data from databases, APIs, files, web scraping
Working with data engineers to access data lakes/warehouses
Storing data securely
✅ Example:
Collect customer transaction data from SQL databases and social media engagement data via
APIs.
🔹 5. Data Understanding
✅ Definition:
This phase involves exploratory data analysis (EDA) to understand data quality, patterns,
anomalies, and relationships.
✅ Key Objectives:
Understand distributions, missing values, and data types.
Identify trends, outliers, or inconsistencies.
Assess whether the data is suitable for analysis.
✅ Tasks Involved:
Statistical summaries (mean, median, std dev)
Visualization (histograms, box plots, scatter plots)
Correlation analysis
✅ Example:
Plotting customer age distribution, checking if older users buy more, spotting missing income
values.
🔹 6. Data Preparation
✅ Definition:
Also known as data wrangling or data preprocessing, this phase prepares data for analysis
by cleaning and transforming it.
✅ Key Objectives:
Improve data quality.
Format data correctly for algorithms.
Create new features or variables.
✅ Tasks Involved:
Handling missing values (imputation or removal)
Removing duplicates
Feature scaling (normalization, standardization)
Encoding categorical variables
Creating derived variables
✅ Example:
Convert gender column to 0/1, scale income values, and fill missing ages using median age.
🔹 7. Modeling
✅ Definition:
This is the phase where machine learning or statistical models are trained using the
prepared data.
✅ Key Objectives:
Select suitable algorithms.
Train models using training data.
Tune model parameters for best performance.
✅ Tasks Involved:
Model training and testing
Cross-validation
Hyperparameter tuning (e.g., using Grid Search)
Comparing different models
✅ Example:
Train a decision tree and a random forest to predict customer churn and compare their
accuracy.
🔹 8. Evaluation
✅ Definition:
Evaluate how well the model performs using defined metrics and business expectations.
✅ Key Objectives:
Check if the model solves the business problem effectively.
Measure performance on unseen/test data.
Verify with stakeholders if results are actionable.
✅ Tasks Involved:
Calculate metrics (Accuracy, Precision, Recall, AUC, RMSE, etc.)
Confusion matrix analysis
Business validation: “Is this useful?”
✅ Example:
Your churn model predicts 85% accuracy, but business asks, “Can it detect high-value
customers who might leave?”
🔹 9. Deployment
✅ Definition:
The process of putting the model into production so it can be used in real-world scenarios.
✅ Key Objectives:
Make the model accessible (as an app, API, or embedded in software).
Ensure system integration.
Plan for scalability and monitoring.
✅ Tasks Involved:
Model exporting and hosting (Flask, FastAPI, AWS, Azure)
Creating dashboards or user interfaces
Scheduling model retraining if needed
✅ Example:
Deploy the churn prediction model in a CRM tool so sales teams get churn alerts for follow-
up.
🔹 10. Feedback
✅ Definition:
Collecting and analyzing feedback to improve the model or system continuously.
✅ Key Objectives:
Measure real-world effectiveness.
Track changes in data (data drift).
Adapt model to new business conditions.
✅ Tasks Involved:
Monitor predictions and accuracy over time.
Collect user/stakeholder feedback.
Plan versioning and retraining.
✅ Example:
If the churn model accuracy drops after 3 months due to new customer behavior, retrain it
with recent data.