Data Mining Regression and Classification
Data Mining Regression and Classification
Overview
In this assignment, you will revisit the pre-processed “Restaurant Orders” dataset from
Homework #1 and apply regression techniques to uncover relationships between different
variables (features) in the dataset. The assignment is divided into three main parts:
1. Single-Feature Linear Regression
3. Polynomial Regression
TableNumber WaiterID OrderDateTime ItemsOrdered NumberOfGuests BillAmount PaymentMethod DiscountUsed WaitTime Tip CustomerSatisfaction
• Select one dependent variable (output) and one independent variable (feature) from the
restaurant dataset.
• Train a simple linear regression model to predict the output from the single chosen feature.
Deliverables
• A short write-up explaining your training procedure, final parameters, final loss, and
observations.
Homework: Description
1. Single-Feature Linear Regression
Tasks/Steps
Data Selection:
Justify which single feature and which output you chose.
Implementation in PyTorch:
Initialize model parameters.
Forward pass and loss function (MSE).
Optimization algorithm (gradient descent).
Visualization:
Plot the data points (scatter plot).
Plot the best-fit line learned by your model on the same figure.
Analysis:
Summarize the training process.
Discuss any difficulties or anomalies you observed when fitting the line.
Interpret how well the linear model fits the data visually and numerically (final loss, etc.).
Homework: Description
2. Multiple-Feature Linear Regression
Objective
• Select 3 features and 1 output from the restaurant dataset (may be the same output
variable as in Part 1 or a different one).
• Train a simple linear regression model to predict the output from the chosen features.
Deliverables
• If it is an achievale task, the deliverables should be similar with the previous Single-Feature
Linear Regression; follow the tasks in the next page.
• If you think it is not an achievale task, prvide the analysis to show the reason; just ignore
the tasks in the next page.
Homework: Description
2. Single-Feature Linear Regression
Tasks/Steps
Data Selection:
Justify which three features and which output you chose; Provide a brief rationale.
Implementation in PyTorch:
Build a multi-feature linear model.
Train and optimize (MSE; gradient descent).
Results:
Show the final loss (training error; MSE).
If the model successfully converges, report your final set of learned weights and bias; if the
model fails to converge or you encounter difficulties, analyze and explain potential reasons.
Interpretation:
Discuss whether the multi-feature regression model appears to be a better fit than a single-
feature model.
Reflect on any new challenges that arose when using multiple features.
Homework: Description
3. Polynomial Regression
Objective
• Using the same 3 features and 1 output from Part 2, implement polynomial regression of at
least three different polynomial degrees (e.g., degree=2, degree=4, degree=6).
• Train a polynomial regression model to predict the output from the chosen features.
Deliverables
• A summary table or short discussion comparing performance for each chosen polynomial
degree.
• Plots or numeric results illustrating how well each polynomial model fits.
• A reflection on potential risks of higher-degree polynomials (e.g., overfitting).
Homework: Description
3. Single-Feature Linear Regression
Tasks/Steps
Feature Transformation:
Explain how you generated polynomial terms (e.g., by manually expanding each feature or
using a PyTorch mechanism for polynomial features).
Decide how you handle interactions (only single-feature powers vs. cross-terms).
Training & Model Comparison:
Train a polynomial regression model for each degree (≥ 3 degrees).
Compare the training losses across different degrees.
Analysis:
Discuss any overfitting or underfitting you observe.
Identify which polynomial degree produced the most favorable result based on loss or
other metrics.
Provide any insights into runtime or complexity differences.
Homework: Description
4. Binary Classification
Tasks/Steps
Choose a Binary Label:
Construct a binary classification label from the dataset (e.g., “Satisfied” vs. “Unsatisfied”
or build any other appropriate yes/no outcome, e.g., “Satisfied” vs. “Not Satisfied
(Unsatisfied + Neutral)”).
Implement Logistic Regression in PyTorch:
Loss function: Binary Cross-Entropy (BCE).
Data Splitting & Preprocessing:
Clearly split your data into training and testing (or validation) sets.
Model Training & Evaluation:
Train on the training set for a certain number of epochs or until convergence.
Report & Visualization
Summarize final training loss, test performance metrics, and any interesting findings.
(Optional) Provide a decision boundary plot if feasible (for a single or two-feature
scenario), or a confusion matrix heatmap to illustrate predictions vs. ground truth.
Homework: Description
5. Multiple-classes Classification
Tasks/Steps
Label Selection:
Identify a multi-class label from your dataset (≥ 3 classes); if your data does not inherently have three
or more distinct classes, you can derive one.
Strategy:
Implement one-vs-all (OvA) and one-vs-one (OvO) logistic regression.
OvA: Train a separate logistic regression classifier for each class vs. “all others.”
OvO: Train pairwise classifiers for each possible pair of classes.
Implementation Details:
In any case, the core idea is to manually handle the multi-class scenario within PyTorch, rather than
relying on built-in high-level methods.
Training & Evaluation
Train each classifier on the training data.
On the test set, produce predictions by combining the outputs of your sub-classifiers (OvA and OvO
logic).
Analysis & Discussion
How does OvA compare to OvO (in terms of code complexity, training time, or performance)?
Homework: Description
Implementation Requirements
1. PyTorch Only
• You must implement the regression logic (forward pass, gradient updates, etc.) in
PyTorch.
• Do not use scikit-learn or other high-level ML libraries to handle the model training.
2. Data Handling
• You may use pandas or plain Python to load the dataset from CSV or other formats.
• Feel free to do any necessary feature engineering or transformations to handle missing
values, scaling, etc.
3. Plots & Visualization
• matplotlib or seaborn is recommended for plotting.
• Clearly label axes, legend, and titles for each figure.
4. Written Report
• Provide your observations, interpretations, and analysis for each part.
• Discuss any difficulties or additional experiments you performed.
Homework: Description
Report & Grading
1. Organization (20%)
• Is your submission clearly structured? Are code, plots, and analysis sections logically
presented?
2. Correctness & Implementation (40%)
• Proper usage of PyTorch for linear and polynomial regression.
• Evidence of correct gradient-based training for each part.
3. Analysis & Interpretation (40%)
• Clarity in explaining results, including final losses, potential reasons for success or
failure.
• Depth of insight into overfitting, data distribution, or hyperparameter choices.
4. Extra Credit / Deep Thinking (up to +10%)
• If your report is well-organized, provides deeper insights or additional experiments
(e.g., trying different regularization, comparing different subsets of features,
exploring other polynomial expansions), you may receive extra points.