0% found this document useful (0 votes)

6 views5 pages

Untitled Document

The project aimed to analyze factors influencing student academic performance and create a predictive dashboard for educators. Key findings included the importance of study hours, attendance, and parental support, as well as disparities faced by first-generation and low-income students. The project involved data cleaning, exploratory analysis, and model training, ultimately deploying a Random Forest model with SHAP for interpretability.

Uploaded by

boleytun256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views5 pages

Untitled Document

Uploaded by

boleytun256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Viva

🎯 Project Aims and What We Discovered

The aim of this project was to analyse how various factors — such as academic behaviour,
lifestyle habits, and socio-demographic backgrounds — influence student academic
performance. I wanted to build a predictive and interpretable dashboard that could:

● Identify key factors that impact performance

● Provide early intervention tools for educators

● Highlight disparities linked to EDI (Equality, Diversity, and Inclusion)

● Present the data in a way that's useful for non-technical users

Key Discoveries:

● Study Hours and Attendance were strong predictors of academic performance

● Parental Support and Motivation had a huge influence

● First-generation and low-income students showed disparities, highlighting systemic

barriers

● Lifestyle habits like Sleep and Extracurriculars also played a part

● Visual, explainable tools made insights more actionable for educators

[1. Data Cleaning and Preprocessing – 5 minutes]

"To begin the project, I started with the data cleaning phase, which was a critical foundation
for everything that followed.

I worked with three different CSV datasets: StudentPerformance.csv,

BehaviouralFactors.csv, and EDI_Dummy_Data.csv. I loaded each dataset into
Google Colab using the Pandas library.
The first thing I did was check the structure and shape of each dataframe using df.info()
and df.head() to get an overview of the data types and general layout.

I then checked for missing values using df.isnull().sum(). To visualise missingness

across the datasets, I used Seaborn’s heatmap along with Missingno’s matrix() and
bar() visualisations.
This helped identify patterns and relationships in the missing data — for example, columns
with high null counts or students who skipped multiple questions.

For numerical columns, I used mean imputation. For example, in the ‘StudyHoursPerWeek’
and ‘SleepHours’ columns, I filled missing values using
SimpleImputer(strategy='mean').
I applied the imputer column by column, and reassigned the filled values back to the
dataset.

For categorical variables like Gender, Language Proficiency, and Socioeconomic

Background, I used mode imputation via
SimpleImputer(strategy='most_frequent').
This replaced nulls with the most common value in each column to reduce noise and
preserve category distribution.

Once imputation was complete, I used boxplots to identify outliers. For example, I used
sns.boxplot() to visualise the distribution of study hours and sleep duration.
Values above the upper bound (like someone studying over 100 hours or reporting 1 hour of
sleep) were either capped or removed, depending on how extreme or likely the error was.

Then I moved on to encoding categorical data:

● For Yes/No columns like ‘Parental Support’, I used binary encoding: 'Yes' was
mapped to 1 and 'No' to 0 using .map().

● For ranked fields like Language Proficiency (Beginner, Intermediate, Advanced), I

created a manual dictionary to apply ordinal encoding.

● For non-ranked fields like Ethnicity or Parental Education, I used one-hot encoding
with pd.get_dummies() to create separate binary columns.

After encoding, I scaled the numeric features using StandardScaler from Scikit-learn to
ensure that variables like Income, Study Hours, and Sleep Hours all operated on a
comparable scale.

Finally, I merged the datasets using StudentID as a key with pd.merge(), and saved the
clean combined dataset using .to_csv() so I could reuse it later without repeating the
process."

📊 [2. Graph Creation and EDA Visualisations – 4

minutes]
"Once I cleaned and merged the data, I performed a deep exploratory data analysis to
understand relationships between variables and guide my modelling decisions.

I used Seaborn and Matplotlib to create all the graphs in my dashboard:

✅ Correlation Heatmap:
I used sns.heatmap(df.corr(), annot=True, cmap='coolwarm') to visualise
how strongly each variable was associated with the performance index.
This revealed that Study Hours, Sleep Hours, and Attendance were most positively
correlated with academic success.
✅ Boxplots:
To compare numerical data against binary outcomes, I used boxplots — like this one:
sns.boxplot(x='Pass/Fail', y='StudyHours', data=df).
This clearly showed that students who passed were consistently studying more hours on
average.

✅ Histograms:
I used df['StudyHours'].hist(bins=20) to understand the distribution of study
habits.
The plot showed a slightly right-skewed curve, meaning a few students were studying
significantly more than the rest — potential outliers I had to look into.

✅ Bar Charts:
I used bar plots to show relationships between categorical variables and performance.
For example, using sns.barplot(x='ParentalSupport',
y='PerformanceIndex', data=df) revealed that students with parental support
consistently scored higher.

✅ Count Plots:
To explore demographic patterns, I used count plots like:
sns.countplot(x='Gender', hue='Pass/Fail', data=df) and repeated this for
Ethnicity, Disability, and First-Generation status.
These plots made it easy to visualise underrepresentation and performance gaps tied to
identity factors.

Every graph was polished with custom colour palettes, axis labels, and titles using
plt.title() and plt.xlabel() to ensure readability.
I saved each plot using plt.savefig() and used them later in the dashboard’s
Performance, EDI, and Student Factors sections."

Certainly! Here's the Model Training and Evaluation section of your viva script, presented
in a detailed and presentation-ready format:

🤖 Model Training and Evaluation – 4 minutes

"To develop a predictive model for student performance, I employed several machine
learning techniques using the Scikit-learn library in Python. My objective was to identify a
model that not only provided high accuracy but also offered interpretability for stakeholders.

Data Splitting: I began by partitioning the cleaned dataset into training and testing subsets
using an 80/20 split via the train_test_split() function. This approach ensured that
the model was trained on a substantial portion of the data while retaining a separate set for
unbiased evaluation.

Model Selection: I experimented with multiple algorithms:

● Linear Regression: Offered simplicity and ease of interpretation but failed to capture
complex, non-linear relationships in the data.

● Decision Tree Regressor: Handled non-linear patterns effectively but was prone to
overfitting, especially with deep trees.

● Random Forest Regressor: An ensemble method that mitigated overfitting by

averaging multiple decision trees, providing robust performance and better
generalization.

Evaluation Metrics: To assess model performance, I utilized the following metrics:

● R² Score (Coefficient of Determination): Measured the proportion of variance in

the dependent variable predictable from the independent variables.

● Mean Absolute Error (MAE): Calculated the average magnitude of errors in

predictions, providing interpretability in the same units as the target variable.

● Root Mean Squared Error (RMSE): Penalized larger errors more than MAE,
offering insight into the model's prediction accuracy.

The Random Forest model outperformed the others, achieving an R² score exceeding 0.85
and the lowest RMSE, indicating its superior predictive capability.

Model Interpretation: Understanding the model's decision-making process was crucial. I

employed:

● Feature Importance Analysis: Identified key predictors such as study hours,

attendance, and sleep duration.

● SHAP (SHapley Additive exPlanations) Values: Provided a unified measure of

feature importance and allowed for the visualization of each feature's impact on
individual predictions, enhancing transparency.

Model Deployment: For integration into the Streamlit dashboard, I serialized the trained
Random Forest model using joblib.dump(). This enabled efficient loading and real-time
predictions within the application without the need for retraining, ensuring a seamless user
experience."

If you'd like assistance with additional sections or further elaboration on any part of your
project, feel free to ask!
end

This project helped me grow in both technical and practical ways.

Firstly, I realised that simple habits — like sleep, motivation, or attendance — can be more
predictive than previous grades. The data showed that students who consistently studied
and slept well were far more likely to succeed.

Secondly, I learned the importance of transparency. Educators don’t just want predictions
— they want to understand why.
Using SHAP helped me build that bridge between machine learning and human
understanding.

Third, the EDI findings really stood out. Even with similar behaviours, students from
underrepresented backgrounds — especially first-gen and lower-income — often had slightly
lower predicted scores.
This opened my eyes to how external factors can influence outcomes in ways that data
alone doesn’t always explain.

On a technical level, I became confident managing a full pipeline — from loading raw data,
to cleaning, to building models, and deploying an interface.
And personally, I learned how to explain technical results clearly, in ways that make sense
to people who aren’t from a data background.

I genuinely feel this dashboard could be used in real institutions — not just to flag at-risk
students, but also to support them with context, transparency, and care."

Zahra Viva Script With Realisations
No ratings yet
Zahra Viva Script With Realisations
3 pages
Phase 3.PDF Ramana
No ratings yet
Phase 3.PDF Ramana
17 pages
Student Performance Analysis and Prediction 2.3
No ratings yet
Student Performance Analysis and Prediction 2.3
19 pages
Student Performance Prediction Report
No ratings yet
Student Performance Prediction Report
9 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
Phase 5
No ratings yet
Phase 5
41 pages
Zahra Code Walkthrough Full Script
No ratings yet
Zahra Code Walkthrough Full Script
2 pages
SFA Paper 3
No ratings yet
SFA Paper 3
2 pages
Comp Applic in Engineering - 2022 - Arashpour
No ratings yet
Comp Applic in Engineering - 2022 - Arashpour
17 pages
Student Performance Analysis Using Machine Learning: Yamnampet, Hyderabad.
No ratings yet
Student Performance Analysis Using Machine Learning: Yamnampet, Hyderabad.
8 pages
SFA Paper 12
No ratings yet
SFA Paper 12
2 pages
Define Objectives: o o o o o
No ratings yet
Define Objectives: o o o o o
3 pages
Sequence Paper
No ratings yet
Sequence Paper
10 pages
Presentation II
No ratings yet
Presentation II
29 pages
Asiign2 Aaryan Ai
No ratings yet
Asiign2 Aaryan Ai
11 pages
Predicting Student Success with ML
No ratings yet
Predicting Student Success with ML
5 pages
Personalized Learning
No ratings yet
Personalized Learning
13 pages
Matplotlib Project Report AIPT
No ratings yet
Matplotlib Project Report AIPT
6 pages
Asiign2 Smith
No ratings yet
Asiign2 Smith
10 pages
Hackathon ps2
No ratings yet
Hackathon ps2
3 pages
MiniProject XLSX Merged1
No ratings yet
MiniProject XLSX Merged1
37 pages
22BCE7750 ML Assignment
No ratings yet
22BCE7750 ML Assignment
23 pages
Student Performance Analysis Guide
No ratings yet
Student Performance Analysis Guide
3 pages
Predictive Analytics for Students
No ratings yet
Predictive Analytics for Students
16 pages
Final22 INT254 Report
No ratings yet
Final22 INT254 Report
10 pages
Ffirst Review
No ratings yet
Ffirst Review
18 pages
Abuja Project
No ratings yet
Abuja Project
4 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Student Performance Prediction: Mukul Gharpure, Pushpak Chaudhari, Yash Bhole, Sagar Borkar, Aashutosh Awasthi
No ratings yet
Student Performance Prediction: Mukul Gharpure, Pushpak Chaudhari, Yash Bhole, Sagar Borkar, Aashutosh Awasthi
7 pages
Ramaswami 2020
No ratings yet
Ramaswami 2020
5 pages
Student Performance PowerBI Full Report
No ratings yet
Student Performance PowerBI Full Report
25 pages
Data Analysis and Data Science Task - 1
No ratings yet
Data Analysis and Data Science Task - 1
3 pages
Project-2 (Educational Improvements Project Problems Statements)
No ratings yet
Project-2 (Educational Improvements Project Problems Statements)
9 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
12058-Article Text-21417-1-10-20220201
No ratings yet
12058-Article Text-21417-1-10-20220201
7 pages
Data Collection & Preprocessing
No ratings yet
Data Collection & Preprocessing
11 pages
Da Pra Week-8 (Karthik S) - 074713
No ratings yet
Da Pra Week-8 (Karthik S) - 074713
9 pages
Student Behavior Analysis Project
No ratings yet
Student Behavior Analysis Project
3 pages
Lucky Mini Project
No ratings yet
Lucky Mini Project
32 pages
Data Mining Approach To Predict Academic Performance of Students
No ratings yet
Data Mining Approach To Predict Academic Performance of Students
11 pages
An E Cient Deep Learning Approach For Prediction of Student Performance Using Neural Network
No ratings yet
An E Cient Deep Learning Approach For Prediction of Student Performance Using Neural Network
13 pages
Python Case Study
No ratings yet
Python Case Study
7 pages
1 Report
No ratings yet
1 Report
45 pages
Report No - 01
No ratings yet
Report No - 01
5 pages
Team Presentation Slides
No ratings yet
Team Presentation Slides
43 pages
Assignment
No ratings yet
Assignment
5 pages
BKH2222MBA128F Slide PDF
No ratings yet
BKH2222MBA128F Slide PDF
8 pages
Developing A Web System For Predicting Student Success Using Learning Analytics
No ratings yet
Developing A Web System For Predicting Student Success Using Learning Analytics
60 pages
EDA New
No ratings yet
EDA New
15 pages
Edu Track
No ratings yet
Edu Track
3 pages
Competency Learning and Student Centric
No ratings yet
Competency Learning and Student Centric
14 pages
Reviewed
No ratings yet
Reviewed
19 pages
Paper Predicting Student Scores
No ratings yet
Paper Predicting Student Scores
10 pages
Mini Project Report On
No ratings yet
Mini Project Report On
17 pages
Project Guidelines (ISE-291 - T 241)
No ratings yet
Project Guidelines (ISE-291 - T 241)
3 pages
Literature Review
No ratings yet
Literature Review
2 pages
Can We Predict Student Performance Based On Tabular and Textual Data
No ratings yet
Can We Predict Student Performance Based On Tabular and Textual Data
12 pages
Aspie Adulting Guide: Work, Home, Life
100% (2)
Aspie Adulting Guide: Work, Home, Life
64 pages
Final Exam TTPG Dec 19
No ratings yet
Final Exam TTPG Dec 19
15 pages
Lec 13-Power Series
No ratings yet
Lec 13-Power Series
63 pages
Cetprospectus 2025
No ratings yet
Cetprospectus 2025
56 pages
Grade 8 Dressmaking: Sewing Machine Maintenance
No ratings yet
Grade 8 Dressmaking: Sewing Machine Maintenance
4 pages
BS Islamic Studies 4TH Semester English Notes
No ratings yet
BS Islamic Studies 4TH Semester English Notes
12 pages
Experiential Learning Presentation With Index
No ratings yet
Experiential Learning Presentation With Index
12 pages
NURS 324 Athabasca
No ratings yet
NURS 324 Athabasca
5 pages
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
No ratings yet
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
19 pages
XMMMX
No ratings yet
XMMMX
14 pages
GE 9 - Rizal Law ORIBIA
No ratings yet
GE 9 - Rizal Law ORIBIA
5 pages
Skills 360 - Making The Most of Personal Learning (Part 1) Discussion Questions
No ratings yet
Skills 360 - Making The Most of Personal Learning (Part 1) Discussion Questions
6 pages
MAPEH 10 Exam
100% (2)
MAPEH 10 Exam
5 pages
Topics Group 8 MTB Mle
No ratings yet
Topics Group 8 MTB Mle
4 pages
Calander 2018-2019 Tusd
No ratings yet
Calander 2018-2019 Tusd
1 page
Learner Profile Brochure
No ratings yet
Learner Profile Brochure
3 pages
Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg PDF Download
100% (1)
Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg PDF Download
147 pages
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
No ratings yet
The Essence of Interdisciplinary Research: Speaker: Martin Dunn Writer: Sreetej Lakkam
2 pages
The Influence of Using English Song Toward Students' Pronunciation Mastery at The Seventh Grade of SMPN 6 Kota Serang
No ratings yet
The Influence of Using English Song Toward Students' Pronunciation Mastery at The Seventh Grade of SMPN 6 Kota Serang
162 pages
Think l5 Unit 3 Vocabulary Extension
100% (1)
Think l5 Unit 3 Vocabulary Extension
2 pages
Perdev q1 Mod3 Kdoctolero - Compress
No ratings yet
Perdev q1 Mod3 Kdoctolero - Compress
24 pages
Issue: 20th February 2011
No ratings yet
Issue: 20th February 2011
4 pages
CEM
No ratings yet
CEM
8 pages
Week 27 Class Vi, History
No ratings yet
Week 27 Class Vi, History
10 pages
Arithmatic Circuit
No ratings yet
Arithmatic Circuit
7 pages
Racism in The Ugly Duckling
No ratings yet
Racism in The Ugly Duckling
4 pages
Resume Asrul Lattes
No ratings yet
Resume Asrul Lattes
5 pages
Teachers As Curriculum Leaders
No ratings yet
Teachers As Curriculum Leaders
2 pages
PCom - Lesson 17
No ratings yet
PCom - Lesson 17
23 pages
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
No ratings yet
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
13 pages