1.
Import Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
import matplotlib.pyplot as plt
Pandas: For handling data.
sklearn: For model training, preprocessing, evaluation, and tuning.
matplotlib: For visualizing feature importance.
2. Upload CSV File
from google.colab import files
uploaded = files.upload()
Opens a file upload widget in Colab to allow uploading the student_profiles.csv.
3. Load and Preprocess Data
df = pd.read_csv('student_profiles.csv')
df.ffill(inplace=True)
Loads the dataset into a DataFrame.
ffill(): Forward fills any missing data.
4. Convert Grades to Numerical Values
grade_mapping = {
"Below 60": 55,
"60-69": 64.5,
"70-79": 74.5,
"80-89": 84.5,
"90-100": 95
}
for col in ['Math Grade', 'Science Grade', 'English Grade']:
df[col] = df[col].map(grade_mapping)
Converts categorical grade ranges into numeric scores using a custom dictionary.
5. Encode Categorical Columns
label_encoders = {}
categorical_cols = ['Career Interest', 'Learning Style', 'Work
Preference', 'Intended Program']
for col in categorical_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
Converts categorical strings to numerical values (Label Encoding) for model
compatibility.
6. Split Data into Features and Target
X = df.drop('Intended Program', axis=1)
y = df['Intended Program']
X: Features (independent variables)
y: Target (dependent variable – the program students intend to take)
7. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
Splits the data into 80% training and 20% testing subsets.
8. Random Forest with Grid Search
rf = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3,
n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
Initializes a Random Forest.
Defines a grid of hyperparameters.
Runs GridSearchCV to find the best combination using 3-fold cross-validation.
9. Evaluate Best Model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Extracts the best model.
Makes predictions on test data.
Prints the accuracy.
10. Feature Importance Visualization
importances = best_rf.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance':
importances}).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Extracts feature importance from the model.
Visualizes which features had the most influence on predictions.
11. Additional Metrics
print(f"Tuned Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print(f"Tuned Precision: {precision_score(y_test, y_pred,
average='weighted') * 100:.2f}%")
print(f"Tuned Recall: {recall_score(y_test, y_pred,
average='weighted') * 100:.2f}%")
print(f"Tuned F1 Score: {f1_score(y_test, y_pred, average='weighted')
* 100:.2f}%")
Evaluates the model further using precision, recall, and F1-score for weighted
(multi-class) classification.