Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views3 pages

Code Explanation

This document outlines a machine learning workflow using a Random Forest Classifier to predict students' intended programs based on their profiles. It includes steps for data import, preprocessing, encoding categorical variables, model training with hyperparameter tuning, evaluation of model accuracy, and visualization of feature importance. Additionally, it provides metrics such as precision, recall, and F1 score for a comprehensive model assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

Code Explanation

This document outlines a machine learning workflow using a Random Forest Classifier to predict students' intended programs based on their profiles. It includes steps for data import, preprocessing, encoding categorical variables, model training with hyperparameter tuning, evaluation of model accuracy, and visualization of feature importance. Additionally, it provides metrics such as precision, recall, and F1 score for a comprehensive model assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Import Required Libraries


import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
import matplotlib.pyplot as plt

Pandas: For handling data.

sklearn: For model training, preprocessing, evaluation, and tuning.

matplotlib: For visualizing feature importance.

2. Upload CSV File


from google.colab import files
uploaded = files.upload()

Opens a file upload widget in Colab to allow uploading the student_profiles.csv.

3. Load and Preprocess Data


df = pd.read_csv('student_profiles.csv')
df.ffill(inplace=True)

Loads the dataset into a DataFrame.

ffill(): Forward fills any missing data.

4. Convert Grades to Numerical Values


grade_mapping = {
"Below 60": 55,
"60-69": 64.5,
"70-79": 74.5,
"80-89": 84.5,
"90-100": 95
}
for col in ['Math Grade', 'Science Grade', 'English Grade']:
df[col] = df[col].map(grade_mapping)

Converts categorical grade ranges into numeric scores using a custom dictionary.
5. Encode Categorical Columns
label_encoders = {}
categorical_cols = ['Career Interest', 'Learning Style', 'Work
Preference', 'Intended Program']
for col in categorical_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le

Converts categorical strings to numerical values (Label Encoding) for model


compatibility.

6. Split Data into Features and Target


X = df.drop('Intended Program', axis=1)
y = df['Intended Program']

X: Features (independent variables)

y: Target (dependent variable – the program students intend to take)

7. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

Splits the data into 80% training and 20% testing subsets.

8. Random Forest with Grid Search


rf = RandomForestClassifier(random_state=42)

param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3,


n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

Initializes a Random Forest.

Defines a grid of hyperparameters.


Runs GridSearchCV to find the best combination using 3-fold cross-validation.

9. Evaluate Best Model


best_rf = grid_search.best_estimator_

y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Extracts the best model.

Makes predictions on test data.

Prints the accuracy.

10. Feature Importance Visualization


importances = best_rf.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance':
importances}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Extracts feature importance from the model.

Visualizes which features had the most influence on predictions.

11. Additional Metrics


print(f"Tuned Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print(f"Tuned Precision: {precision_score(y_test, y_pred,
average='weighted') * 100:.2f}%")
print(f"Tuned Recall: {recall_score(y_test, y_pred,
average='weighted') * 100:.2f}%")
print(f"Tuned F1 Score: {f1_score(y_test, y_pred, average='weighted')
* 100:.2f}%")

Evaluates the model further using precision, recall, and F1-score for weighted
(multi-class) classification.

You might also like