PROJECT REPORT
DR. AKHILESH DAS GUPTA INSTITUTE OF PROFESSIONAL STUDIES
MOVIE GENRE CLASSIFICATION
SUBMITTED BY:
Lakshay Kumar
CSE [0541602723]
Khushi
CSE [13115602723]
Abhinav Sharma
CSE [12815602723]
Abstract
The objective of this project was to develop a machine learning model capable of predicting the
genre of a movie based on its metadata and description. Using a dataset of 50,000 movies with 17
features—including title, year, director, cast, production details, and a brief description—the
project aimed to automate genre classification, which is valuable for recommendation systems
and content organization.
The methodology involved extensive data preprocessing, including handling missing values,
encoding categorical variables with one-hot encoding, and extracting text features from movie
descriptions using TF-IDF vectorization. Numerical features were scaled appropriately. Three
classification algorithms were implemented and evaluated: Logistic Regression, Decision Tree, and
Random Forest. The dataset was split into training and testing sets to assess model performance.
Key findings revealed that both Logistic Regression and Decision Tree models achieved low
accuracy (around 14%), indicating the challenge of the task and possible feature limitations. The
Random Forest model, however, showed perfect accuracy (100%), which suggests overfitting or
data leakage rather than genuine predictive power.
In conclusion, while the project successfully demonstrated the end-to-end process of genre
classification using machine learning, it also underscored the need for further improvements.
Addressing class imbalance, refining feature engineering, and preventing overfitting are essential
next steps to enhance model reliability and real-world applicability.
Introduction
Background of the Topic:
With the exponential growth of digital media, organizing and recommending movies efficiently has become
a significant challenge for streaming platforms and content providers. Movie genre classification,
traditionally performed manually, is essential for improving user experience, enabling effective search, and
powering recommendation engines. Leveraging machine learning to automate this process can greatly
enhance scalability and accuracy.
Objectives of the Project:
The primary objective of this project is to develop a machine learning model capable of accurately
predicting the genre of a movie based on its metadata and description. This involves preprocessing a large
dataset, engineering relevant features, and evaluating multiple classification algorithms to identify the most
effective approach.
Scope and Limitations:
The project focuses on supervised classification using a dataset of 50,000 movies with 17 features, including
both numerical and categorical data. The scope includes data cleaning, feature engineering, model training,
and evaluation. Limitations include potential class imbalance, limited feature diversity (e.g., reliance on
available metadata and descriptions), and the challenge of capturing nuanced genre distinctions.
Relevance/Importance of the Project:
Automating genre classification is highly relevant for the entertainment industry, especially for streaming
services and digital libraries. Accurate genre prediction enhances content discovery, personalised
recommendations, and streamlines content management. This project demonstrates the practical
application of machine learning in addressing real-world challenges in media organization and user
engagement.
DATASET
50,000 movies with 17 features
FEATURES:
Title – Name of the movie
Year – Release year
Director – Name of the director
Duration – Duration of the movie (in minutes)
Rating – IMDb or similar rating (float)
Votes – Number of user votes
Description – Short summary or plot description
Language – Primary language of the movie
Country – Country of origin
Budget_USD – Budget in US dollars
BoxOffice_USD – Box office revenue in US dollars
Genre – Target variable (movie genre)
Production_Company – Name of the production company
Content_Rating – Content rating (e.g., PG, R)
Lead_Actor – Name of the lead actor
Num_Awards – Number of awards won
Critic_Reviews – Number of critic reviews
Key Metrics:
Accuracy
Measures the proportion of correct predictions out of all predictions
made.
Example: Logistic Regression Test Accuracy = 14.8%
Precision
Indicates the proportion of positive identifications that were actually
correct (for each genre, averaged as “weighted”).
Example: Logistic Regression Test Precision = 0.046
Recall
Measures the proportion of actual positives that were correctly identified.
Example: Logistic Regression Test Recall = 0.148
F1 Score
Harmonic mean of precision and recall, providing a balance between the
two.
Example: Logistic Regression Test F1 Score = 0.057
Confusion Matrix
A table showing the number of correct and incorrect predictions for each
genre, helping to visualize model performance and errors.
These metrics were calculated for each model (Logistic Regression, Decision Tree, Random
Forest) on both training and test datasets to assess their effectiveness and identify issues
like overfitting or class imbalance.
DATA CLEANING & PREPROCESSING
These datasets contains respective columns and the description are as follows:
LIBRARIES:
Checking for null values & duplicate rows
Here we are checking for null values using isnull() & using duplicated() for duplicate values. The use of
sum() is to summing up all null & duplicate values.
Data Visualization
Histograms
➔ Purpose: To visualize the distribution of key numerical features.
➔ Features Visualized:
Duration
Votes
Budget_USD
➔ Insight: Shows the spread, skewness, and outliers in these features
➔ Boxplots
◆ Purpose: To identify the presence of outliers and understand the spread of all numeric
columns.
◆ Features Visualized:
● Year
● Duration
● Rating
● Votes
● Budget_USD
● BoxOffice_USD
● Num_Awards
● Critic_Reviews
◆ Insight: Highlights the median, quartiles, and outliers for each numeric feature.
➔ Pie Charts
◆ Purpose: To show the proportion of categories within categorical features.
◆ Features Visualized:
● Language
● Production_Company
◆ Insight: Visualizes the distribution of movies by language and production company.
Encoding for categorical columns &
defining Numerical columns
One-hot encoding was used to convert categorical features like Director,
Production_Company, Lead_Actor, Content_Rating, Language, and Country
into a numerical format. This allowed machine learning models to process
these variables effectively, ensuring that no ordinal relationships were
assumed between different categories during genre prediction.
Here we also defined the Numerical columns in X_num
MODEL BUILDING AND EVALUATION
X: Represents the features or independent variables. y:
Represents the target variable or dependent variable.
X_train: Represents the training data for features.
X_test: Represents the testing data for features.
y_train: Represents the training data for the target variable. y_test:
Represents the testing data for the target variable.
Scaling:
Numerical features (like Year, Duration, Rating, Votes, Budget, BoxOffice, Num_Awards,
Critic_Reviews) were scaled using MaxAbsScaler to ensure all values were on a similar scale. This
helps machine learning models converge faster and prevents features with larger values from
dominating the learning process.
Training:
The processed dataset was split into training and testing sets. Multiple classification
models—Logistic Regression, Decision Tree, and Random Forest—were trained on the scaled
training data to learn patterns for predicting movie genres. Model performance was then
evaluated on the test set using accuracy, precision, recall, and F1 score.
Logistic Regression Model
Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.039):
Proportion of predicted genres that were actually correct; only 3.9%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.053):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 5.3%.
DECISION TREE Model
Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.020):
Proportion of predicted genres that were actually correct; only 2.0%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.036):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 3.6%
RANDOM FOREST MODEL
Accuracy (100%):
Percentage of total predictions that were correct; the model classified all movie
genres correctly on the test set.
Precision (1.0):
Proportion of predicted genres that were actually correct; every positive
prediction made by the model was accurate.
Recall (1.0):
Proportion of actual genres correctly identified by the model; the model found all
true genre cases.
F1 Score (1.0):
Harmonic mean of precision and recall; the model achieved perfect balance
between precision and recall.
Predicted GENRE:
Here our model has predicted the possible genre for a given data of a movie.
CONCLUSION
In this project, we explored the application of machine learning techniques
to predict movie genres based on structured data and textual descriptions.
Using a dataset of 50,000 movies, we performed comprehensive
preprocessing that included handling categorical data through one-hot
encoding, transforming textual data using TF-IDF vectorization, and scaling
numerical features with MaxAbsScaler.
Three classification models—Logistic Regression, Decision Tree Classifier,
and Random Forest Classifier—were implemented and evaluated. Logistic
Regression and Decision Tree models showed limited predictive capability
with accuracy scores around 14–15%, precision below 5%, and low F1-
scores. These results indicate that genre prediction is a complex multi-class
classification problem, possibly affected by overlapping features and class
imbalance.
On the other hand, the Random Forest Classifier achieved perfect scores on
all metrics, suggesting overfitting to the training data rather than true
generalization. This highlights the importance of proper model tuning and
cross-validation to avoid misleading performance.
Overall, the project demonstrates the potential and challenges of genre
classification using machine learning. Future improvements could involve
balancing genre classes, tuning hyperparameters, applying deep learning
on text data, or integrating metadata with user engagement signals for
improved accuracy and generalizability.
Future Scope
1. Handle Class Imbalance
● Some genres are more frequent than others, which can bias the
model.
● Use techniques like SMOTE, under sampling, or class-weight
adjustment to balance the dataset.
2. Advanced Text Processing
● Replace TF-IDF with pre-trained language models like BERT,
RoBERTa, or DistilBERT to better capture semantic meaning in
movie descriptions.
3. Hyperparameter Tuning
● Use Grid Search or Random Search to optimize model
parameters for better generalization and reduced overfitting
(especially for Random Forest).
4. Use Deep Learning Models
● Implement models like LSTM, CNN for text, or transformer-
based architectures for improved performance, especially on
textual data.
Machine Learning Models Used – Explanation &
Justification
In this project, we used three different classification models: Logistic Regression, Decision Tree, and
Random Forest. Each model was selected to bring a different perspective to the genre prediction task
and to compare their strengths and weaknesses.
1. Logistic Regression (LR)
● Purpose in this model: Served as a baseline classifier to evaluate how well a linear model
performs with TF-IDF and one-hot encoded features.
● Use:
○ It was trained on scaled TF-IDF, categorical, and numerical data.
○ Helped assess whether genre labels could be separated linearly.
● Result: Very low accuracy (~14.85%) indicating that the relationship between features and
genre is likely non-linear.
2. Decision Tree Classifier (DT)
● Purpose in this model: To explore non-linear relationships and see how well a single-tree
structure could learn genre patterns.
● Use:
○ Allowed understanding of how specific features (e.g., lead actor, description keywords)
impact genre classification.
○ Showed which features had the highest importance.
● Result: Slightly lower performance than LR, with issues like overfitting and limited
generalization.
3. Random Forest Classifier (RF)
● Purpose in this model: Acted as a powerful ensemble learner to improve prediction
accuracy by reducing variance.
● Use:
○ Aggregated predictions from multiple decision trees to produce stable and accurate
predictions.
○ Effectively captured complex interactions between features.
● Result: Achieved 100% accuracy, likely due to overfitting, suggesting that it memorized the
training data but didn’t generalize well without regularization or tuning.
Project Drive Link: -
https://colab.research.google.com/drive/1x5HcMwPMMD_gfZYBCM5_kh0n8ebAuY9e?usp=sharing
—END OF REPORT—