Viva
🎯 Project Aims and What We Discovered
The aim of this project was to analyse how various factors — such as academic behaviour,
lifestyle habits, and socio-demographic backgrounds — influence student academic
performance. I wanted to build a predictive and interpretable dashboard that could:
● Identify key factors that impact performance
● Provide early intervention tools for educators
● Highlight disparities linked to EDI (Equality, Diversity, and Inclusion)
● Present the data in a way that's useful for non-technical users
Key Discoveries:
● Study Hours and Attendance were strong predictors of academic performance
● Parental Support and Motivation had a huge influence
● First-generation and low-income students showed disparities, highlighting systemic
barriers
● Lifestyle habits like Sleep and Extracurriculars also played a part
● Visual, explainable tools made insights more actionable for educators
[1. Data Cleaning and Preprocessing – 5 minutes]
"To begin the project, I started with the data cleaning phase, which was a critical foundation
for everything that followed.
I worked with three different CSV datasets: StudentPerformance.csv,
BehaviouralFactors.csv, and EDI_Dummy_Data.csv. I loaded each dataset into
Google Colab using the Pandas library.
The first thing I did was check the structure and shape of each dataframe using df.info()
and df.head() to get an overview of the data types and general layout.
I then checked for missing values using df.isnull().sum(). To visualise missingness
across the datasets, I used Seaborn’s heatmap along with Missingno’s matrix() and
bar() visualisations.
This helped identify patterns and relationships in the missing data — for example, columns
with high null counts or students who skipped multiple questions.
For numerical columns, I used mean imputation. For example, in the ‘StudyHoursPerWeek’
and ‘SleepHours’ columns, I filled missing values using
SimpleImputer(strategy='mean').
I applied the imputer column by column, and reassigned the filled values back to the
dataset.
For categorical variables like Gender, Language Proficiency, and Socioeconomic
Background, I used mode imputation via
SimpleImputer(strategy='most_frequent').
This replaced nulls with the most common value in each column to reduce noise and
preserve category distribution.
Once imputation was complete, I used boxplots to identify outliers. For example, I used
sns.boxplot() to visualise the distribution of study hours and sleep duration.
Values above the upper bound (like someone studying over 100 hours or reporting 1 hour of
sleep) were either capped or removed, depending on how extreme or likely the error was.
Then I moved on to encoding categorical data:
● For Yes/No columns like ‘Parental Support’, I used binary encoding: 'Yes' was
mapped to 1 and 'No' to 0 using .map().
● For ranked fields like Language Proficiency (Beginner, Intermediate, Advanced), I
created a manual dictionary to apply ordinal encoding.
● For non-ranked fields like Ethnicity or Parental Education, I used one-hot encoding
with pd.get_dummies() to create separate binary columns.
After encoding, I scaled the numeric features using StandardScaler from Scikit-learn to
ensure that variables like Income, Study Hours, and Sleep Hours all operated on a
comparable scale.
Finally, I merged the datasets using StudentID as a key with pd.merge(), and saved the
clean combined dataset using .to_csv() so I could reuse it later without repeating the
process."
📊 [2. Graph Creation and EDA Visualisations – 4
minutes]
"Once I cleaned and merged the data, I performed a deep exploratory data analysis to
understand relationships between variables and guide my modelling decisions.
I used Seaborn and Matplotlib to create all the graphs in my dashboard:
✅ Correlation Heatmap:
I used sns.heatmap(df.corr(), annot=True, cmap='coolwarm') to visualise
how strongly each variable was associated with the performance index.
This revealed that Study Hours, Sleep Hours, and Attendance were most positively
correlated with academic success.
✅ Boxplots:
To compare numerical data against binary outcomes, I used boxplots — like this one:
sns.boxplot(x='Pass/Fail', y='StudyHours', data=df).
This clearly showed that students who passed were consistently studying more hours on
average.
✅ Histograms:
I used df['StudyHours'].hist(bins=20) to understand the distribution of study
habits.
The plot showed a slightly right-skewed curve, meaning a few students were studying
significantly more than the rest — potential outliers I had to look into.
✅ Bar Charts:
I used bar plots to show relationships between categorical variables and performance.
For example, using sns.barplot(x='ParentalSupport',
y='PerformanceIndex', data=df) revealed that students with parental support
consistently scored higher.
✅ Count Plots:
To explore demographic patterns, I used count plots like:
sns.countplot(x='Gender', hue='Pass/Fail', data=df) and repeated this for
Ethnicity, Disability, and First-Generation status.
These plots made it easy to visualise underrepresentation and performance gaps tied to
identity factors.
Every graph was polished with custom colour palettes, axis labels, and titles using
plt.title() and plt.xlabel() to ensure readability.
I saved each plot using plt.savefig() and used them later in the dashboard’s
Performance, EDI, and Student Factors sections."
Certainly! Here's the Model Training and Evaluation section of your viva script, presented
in a detailed and presentation-ready format:
🤖 Model Training and Evaluation – 4 minutes
"To develop a predictive model for student performance, I employed several machine
learning techniques using the Scikit-learn library in Python. My objective was to identify a
model that not only provided high accuracy but also offered interpretability for stakeholders.
Data Splitting: I began by partitioning the cleaned dataset into training and testing subsets
using an 80/20 split via the train_test_split() function. This approach ensured that
the model was trained on a substantial portion of the data while retaining a separate set for
unbiased evaluation.
Model Selection: I experimented with multiple algorithms:
● Linear Regression: Offered simplicity and ease of interpretation but failed to capture
complex, non-linear relationships in the data.
● Decision Tree Regressor: Handled non-linear patterns effectively but was prone to
overfitting, especially with deep trees.
● Random Forest Regressor: An ensemble method that mitigated overfitting by
averaging multiple decision trees, providing robust performance and better
generalization.
Evaluation Metrics: To assess model performance, I utilized the following metrics:
● R² Score (Coefficient of Determination): Measured the proportion of variance in
the dependent variable predictable from the independent variables.
● Mean Absolute Error (MAE): Calculated the average magnitude of errors in
predictions, providing interpretability in the same units as the target variable.
● Root Mean Squared Error (RMSE): Penalized larger errors more than MAE,
offering insight into the model's prediction accuracy.
The Random Forest model outperformed the others, achieving an R² score exceeding 0.85
and the lowest RMSE, indicating its superior predictive capability.
Model Interpretation: Understanding the model's decision-making process was crucial. I
employed:
● Feature Importance Analysis: Identified key predictors such as study hours,
attendance, and sleep duration.
● SHAP (SHapley Additive exPlanations) Values: Provided a unified measure of
feature importance and allowed for the visualization of each feature's impact on
individual predictions, enhancing transparency.
Model Deployment: For integration into the Streamlit dashboard, I serialized the trained
Random Forest model using joblib.dump(). This enabled efficient loading and real-time
predictions within the application without the need for retraining, ensuring a seamless user
experience."
If you'd like assistance with additional sections or further elaboration on any part of your
project, feel free to ask!
end
This project helped me grow in both technical and practical ways.
Firstly, I realised that simple habits — like sleep, motivation, or attendance — can be more
predictive than previous grades. The data showed that students who consistently studied
and slept well were far more likely to succeed.
Secondly, I learned the importance of transparency. Educators don’t just want predictions
— they want to understand why.
Using SHAP helped me build that bridge between machine learning and human
understanding.
Third, the EDI findings really stood out. Even with similar behaviours, students from
underrepresented backgrounds — especially first-gen and lower-income — often had slightly
lower predicted scores.
This opened my eyes to how external factors can influence outcomes in ways that data
alone doesn’t always explain.
On a technical level, I became confident managing a full pipeline — from loading raw data,
to cleaning, to building models, and deploying an interface.
And personally, I learned how to explain technical results clearly, in ways that make sense
to people who aren’t from a data background.
I genuinely feel this dashboard could be used in real institutions — not just to flag at-risk
students, but also to support them with context, transparency, and care."