Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views19 pages

Project - Report (Movie Genre Classification)

The project aimed to develop a machine learning model for predicting movie genres using a dataset of 50,000 movies with various features. While Logistic Regression and Decision Tree models showed low accuracy (around 14%), the Random Forest model achieved 100% accuracy, indicating potential overfitting. The report highlights the importance of addressing class imbalance and refining feature engineering for improved model reliability in genre classification.

Uploaded by

bangerx826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Project - Report (Movie Genre Classification)

The project aimed to develop a machine learning model for predicting movie genres using a dataset of 50,000 movies with various features. While Logistic Regression and Decision Tree models showed low accuracy (around 14%), the Random Forest model achieved 100% accuracy, indicating potential overfitting. The report highlights the importance of addressing class imbalance and refining feature engineering for improved model reliability in genre classification.

Uploaded by

bangerx826
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

PROJECT REPORT

DR. AKHILESH DAS GUPTA INSTITUTE OF PROFESSIONAL STUDIES

MOVIE GENRE CLASSIFICATION

SUBMITTED BY:

Lakshay Kumar
CSE [0541602723]

Khushi
CSE [13115602723]

Abhinav Sharma
CSE [12815602723]
Abstract
The objective of this project was to develop a machine learning model capable of predicting the

genre of a movie based on its metadata and description. Using a dataset of 50,000 movies with 17

features—including title, year, director, cast, production details, and a brief description—the

project aimed to automate genre classification, which is valuable for recommendation systems

and content organization.

The methodology involved extensive data preprocessing, including handling missing values,

encoding categorical variables with one-hot encoding, and extracting text features from movie

descriptions using TF-IDF vectorization. Numerical features were scaled appropriately. Three

classification algorithms were implemented and evaluated: Logistic Regression, Decision Tree, and

Random Forest. The dataset was split into training and testing sets to assess model performance.

Key findings revealed that both Logistic Regression and Decision Tree models achieved low

accuracy (around 14%), indicating the challenge of the task and possible feature limitations. The

Random Forest model, however, showed perfect accuracy (100%), which suggests overfitting or

data leakage rather than genuine predictive power.

In conclusion, while the project successfully demonstrated the end-to-end process of genre

classification using machine learning, it also underscored the need for further improvements.

Addressing class imbalance, refining feature engineering, and preventing overfitting are essential

next steps to enhance model reliability and real-world applicability.

Introduction
Background of the Topic:

With the exponential growth of digital media, organizing and recommending movies efficiently has become

a significant challenge for streaming platforms and content providers. Movie genre classification,

traditionally performed manually, is essential for improving user experience, enabling effective search, and

powering recommendation engines. Leveraging machine learning to automate this process can greatly

enhance scalability and accuracy.


Objectives of the Project:

The primary objective of this project is to develop a machine learning model capable of accurately

predicting the genre of a movie based on its metadata and description. This involves preprocessing a large

dataset, engineering relevant features, and evaluating multiple classification algorithms to identify the most

effective approach.

Scope and Limitations:

The project focuses on supervised classification using a dataset of 50,000 movies with 17 features, including

both numerical and categorical data. The scope includes data cleaning, feature engineering, model training,

and evaluation. Limitations include potential class imbalance, limited feature diversity (e.g., reliance on

available metadata and descriptions), and the challenge of capturing nuanced genre distinctions.

Relevance/Importance of the Project:

Automating genre classification is highly relevant for the entertainment industry, especially for streaming

services and digital libraries. Accurate genre prediction enhances content discovery, personalised

recommendations, and streamlines content management. This project demonstrates the practical

application of machine learning in addressing real-world challenges in media organization and user

engagement.

DATASET
50,000 movies with 17 features

FEATURES:

Title – Name of the movie


Year – Release year
Director – Name of the director
Duration – Duration of the movie (in minutes)
Rating – IMDb or similar rating (float)
Votes – Number of user votes
Description – Short summary or plot description
Language – Primary language of the movie
Country – Country of origin
Budget_USD – Budget in US dollars
BoxOffice_USD – Box office revenue in US dollars
Genre – Target variable (movie genre)
Production_Company – Name of the production company
Content_Rating – Content rating (e.g., PG, R)
Lead_Actor – Name of the lead actor
Num_Awards – Number of awards won
Critic_Reviews – Number of critic reviews

Key Metrics:
Accuracy
Measures the proportion of correct predictions out of all predictions
made.
Example: Logistic Regression Test Accuracy = 14.8%

Precision

Indicates the proportion of positive identifications that were actually


correct (for each genre, averaged as “weighted”).

Example: Logistic Regression Test Precision = 0.046

Recall

Measures the proportion of actual positives that were correctly identified.

Example: Logistic Regression Test Recall = 0.148

F1 Score

Harmonic mean of precision and recall, providing a balance between the


two.

Example: Logistic Regression Test F1 Score = 0.057


Confusion Matrix
A table showing the number of correct and incorrect predictions for each
genre, helping to visualize model performance and errors.

These metrics were calculated for each model (Logistic Regression, Decision Tree, Random
Forest) on both training and test datasets to assess their effectiveness and identify issues
like overfitting or class imbalance.

DATA CLEANING & PREPROCESSING


These datasets contains respective columns and the description are as follows:

LIBRARIES:
Checking for null values & duplicate rows

Here we are checking for null values using isnull() & using duplicated() for duplicate values. The use of
sum() is to summing up all null & duplicate values.
Data Visualization
Histograms
➔ Purpose: To visualize the distribution of key numerical features.

➔ Features Visualized:
Duration
Votes
Budget_USD

➔ Insight: Shows the spread, skewness, and outliers in these features

➔ Boxplots

◆ Purpose: To identify the presence of outliers and understand the spread of all numeric
columns.

◆ Features Visualized:

● Year
● Duration

● Rating

● Votes

● Budget_USD

● BoxOffice_USD

● Num_Awards

● Critic_Reviews

◆ Insight: Highlights the median, quartiles, and outliers for each numeric feature.
➔ Pie Charts

◆ Purpose: To show the proportion of categories within categorical features.

◆ Features Visualized:

● Language

● Production_Company

◆ Insight: Visualizes the distribution of movies by language and production company.


Encoding for categorical columns &
defining Numerical columns

One-hot encoding was used to convert categorical features like Director,


Production_Company, Lead_Actor, Content_Rating, Language, and Country
into a numerical format. This allowed machine learning models to process
these variables effectively, ensuring that no ordinal relationships were
assumed between different categories during genre prediction.
Here we also defined the Numerical columns in X_num
MODEL BUILDING AND EVALUATION

X: Represents the features or independent variables. y:


Represents the target variable or dependent variable.
X_train: Represents the training data for features.
X_test: Represents the testing data for features.
y_train: Represents the training data for the target variable. y_test:
Represents the testing data for the target variable.

Scaling:

Numerical features (like Year, Duration, Rating, Votes, Budget, BoxOffice, Num_Awards,
Critic_Reviews) were scaled using MaxAbsScaler to ensure all values were on a similar scale. This
helps machine learning models converge faster and prevents features with larger values from
dominating the learning process.

Training:
The processed dataset was split into training and testing sets. Multiple classification
models—Logistic Regression, Decision Tree, and Random Forest—were trained on the scaled
training data to learn patterns for predicting movie genres. Model performance was then
evaluated on the test set using accuracy, precision, recall, and F1 score.
Logistic Regression Model

Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.039):
Proportion of predicted genres that were actually correct; only 3.9%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.053):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 5.3%.

DECISION TREE Model


Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.020):
Proportion of predicted genres that were actually correct; only 2.0%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.036):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 3.6%

RANDOM FOREST MODEL


Accuracy (100%):
Percentage of total predictions that were correct; the model classified all movie
genres correctly on the test set.
Precision (1.0):
Proportion of predicted genres that were actually correct; every positive
prediction made by the model was accurate.
Recall (1.0):
Proportion of actual genres correctly identified by the model; the model found all
true genre cases.
F1 Score (1.0):
Harmonic mean of precision and recall; the model achieved perfect balance
between precision and recall.

Predicted GENRE:
Here our model has predicted the possible genre for a given data of a movie.

CONCLUSION
In this project, we explored the application of machine learning techniques
to predict movie genres based on structured data and textual descriptions.
Using a dataset of 50,000 movies, we performed comprehensive
preprocessing that included handling categorical data through one-hot
encoding, transforming textual data using TF-IDF vectorization, and scaling
numerical features with MaxAbsScaler.
Three classification models—Logistic Regression, Decision Tree Classifier,
and Random Forest Classifier—were implemented and evaluated. Logistic
Regression and Decision Tree models showed limited predictive capability
with accuracy scores around 14–15%, precision below 5%, and low F1-
scores. These results indicate that genre prediction is a complex multi-class
classification problem, possibly affected by overlapping features and class
imbalance.

On the other hand, the Random Forest Classifier achieved perfect scores on
all metrics, suggesting overfitting to the training data rather than true
generalization. This highlights the importance of proper model tuning and
cross-validation to avoid misleading performance.

Overall, the project demonstrates the potential and challenges of genre


classification using machine learning. Future improvements could involve
balancing genre classes, tuning hyperparameters, applying deep learning
on text data, or integrating metadata with user engagement signals for
improved accuracy and generalizability.

Future Scope

1. Handle Class Imbalance

● Some genres are more frequent than others, which can bias the
model.
● Use techniques like SMOTE, under sampling, or class-weight
adjustment to balance the dataset.
2. Advanced Text Processing

● Replace TF-IDF with pre-trained language models like BERT,


RoBERTa, or DistilBERT to better capture semantic meaning in
movie descriptions.

3. Hyperparameter Tuning

● Use Grid Search or Random Search to optimize model


parameters for better generalization and reduced overfitting
(especially for Random Forest).

4. Use Deep Learning Models

● Implement models like LSTM, CNN for text, or transformer-


based architectures for improved performance, especially on
textual data.

Machine Learning Models Used – Explanation &


Justification
In this project, we used three different classification models: Logistic Regression, Decision Tree, and
Random Forest. Each model was selected to bring a different perspective to the genre prediction task
and to compare their strengths and weaknesses.

1. Logistic Regression (LR)

● Purpose in this model: Served as a baseline classifier to evaluate how well a linear model
performs with TF-IDF and one-hot encoded features.
● Use:
○ It was trained on scaled TF-IDF, categorical, and numerical data.
○ Helped assess whether genre labels could be separated linearly.
● Result: Very low accuracy (~14.85%) indicating that the relationship between features and
genre is likely non-linear.

2. Decision Tree Classifier (DT)

● Purpose in this model: To explore non-linear relationships and see how well a single-tree
structure could learn genre patterns.
● Use:
○ Allowed understanding of how specific features (e.g., lead actor, description keywords)
impact genre classification.
○ Showed which features had the highest importance.
● Result: Slightly lower performance than LR, with issues like overfitting and limited
generalization.

3. Random Forest Classifier (RF)

● Purpose in this model: Acted as a powerful ensemble learner to improve prediction


accuracy by reducing variance.
● Use:
○ Aggregated predictions from multiple decision trees to produce stable and accurate
predictions.
○ Effectively captured complex interactions between features.
● Result: Achieved 100% accuracy, likely due to overfitting, suggesting that it memorized the
training data but didn’t generalize well without regularization or tuning.

Project Drive Link: -


https://colab.research.google.com/drive/1x5HcMwPMMD_gfZYBCM5_kh0n8ebAuY9e?usp=sharing

—END OF REPORT—

You might also like