Data Analytics - Unit 4 Full Notes
1. Supervised vs Unsupervised Learning
Supervised Learning vs Unsupervised Learning:
| Feature | Supervised Learning | Unsupervised Learning |
|-------------------------------|----------------------------------------------------------|----------------------------------------------------------|
| Definition | Learning with labeled data | Learning with unlabeled data |
| Input Data | Input has output labels | Input has no output labels |
| Goal | Predict output | Discover hidden patterns |
| Output Type | Predictive (classification/regression) | Descriptive (clusters/associations)
|
| Examples of Tasks | Classification, Regression | Clustering, Association
|
| Evaluation | Accuracy, RMSE, etc. | Silhouette score, manual interpretation
|
| Algorithms | Decision Trees, SVM, Linear Regression | K-Means, DBSCAN, PCA
|
| Use Cases | Email spam detection, loan approval | Customer segmentation, anomaly
detection |
2. Segmentation
Segmentation is the process of dividing a dataset into smaller, meaningful subgroups based on similarities in attributes
or behavior.
Types of Segmentation:
- Demographic: Age, income, gender
- Geographic: Region, city, country
- Behavioral: Purchase habits, product usage
- Psychographic: Lifestyle, interests
Segmentation Techniques:
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Self-Organizing Maps (SOM)
Applications:
- Marketing: Targeting specific customer groups
- Healthcare: Grouping patients by conditions
Data Analytics - Unit 4 Full Notes
- Retail: Personalizing product recommendations
Goal: Improve analysis, decision-making, and forecasting by understanding group-specific behavior.
3. Decision Trees
Decision Trees are flowchart-like structures used for classification and regression tasks.
Types:
- Classification Tree: Output is categorical
- Regression Tree: Output is numerical
Structure:
- Nodes: Attribute tests
- Branches: Outcomes of tests
- Leaves: Final decisions or class labels
Splitting Criteria:
- Gini Index, Entropy/Information Gain for classification
- Variance reduction for regression
Process:
1. Choose the best splitting attribute
2. Partition the data accordingly
3. Recursively build subtrees
4. Stop when data is pure or depth is limited
Challenges:
- Overfitting: Very deep trees memorize noise
- Pruning: Technique to simplify the tree by removing branches
Ensembles (Multiple Trees):
- Random Forests: Uses voting among multiple trees
- Boosting: Combines weak learners into a strong model
Applications: Credit scoring, medical diagnosis, churn prediction
4. Overfitting and Pruning
Overfitting occurs when a model learns the training data too closely, including noise and anomalies, leading to poor
Data Analytics - Unit 4 Full Notes
generalization.
Symptoms:
- High training accuracy but low test accuracy
- Complex and deep tree structure
Causes:
- Too many attributes
- Lack of pruning
- Small datasets
Pruning is used to reduce tree size and improve generalization.
Types of Pruning:
- Pre-Pruning: Stops tree growth early (e.g., max depth, min samples)
- Post-Pruning: Removes unnecessary branches after full tree is built
Benefits:
- Reduces overfitting
- Improves prediction on unseen data
- Enhances interpretability
Goal: Build a model that balances complexity and accuracy.
5. Measures of Forecast Accuracy
Forecast accuracy metrics evaluate how close predictions are to actual values.
Common Metrics:
- MAE (Mean Absolute Error): Average of absolute errors
- MSE (Mean Squared Error): Average of squared errors
- RMSE (Root Mean Squared Error): Square root of MSE
- MAPE (Mean Absolute Percentage Error): Error as a percentage
- sMAPE (Symmetric MAPE): Balanced version of MAPE
Choosing the Right Metric:
- Use MAE for simple average error
- Use RMSE when large errors matter more
- Use MAPE for relative accuracy (not if data has zero values)
Applications:
Data Analytics - Unit 4 Full Notes
- Retail: Sales forecasting
- Finance: Stock price prediction
- Healthcare: Patient count prediction
Lower metric values indicate higher accuracy.
6. STL Decomposition
STL (Seasonal and Trend decomposition using Loess) breaks a time series into three components:
1. Trend: Long-term progression
2. Seasonality: Repeating short-term cycles
3. Residual: Random noise
STL uses LOESS (Local regression) for smoothing and is highly flexible.
Advantages:
- Works with any seasonality type
- Robust to outliers
- Allows component-wise analysis
Steps:
1. Input time series
2. Apply smoothing to extract trend and seasonality
3. Subtract from original to get residual
Applications:
- Retail: Understand sales trends
- Finance: Analyze stock patterns
- Weather: Seasonal forecasting
STL is ideal for preprocessing time series before applying models like ARIMA.