0% found this document useful (0 votes)

13 views19 pages

Project - Report (Movie Genre Classification)

The project aimed to develop a machine learning model for predicting movie genres using a dataset of 50,000 movies with various features. While Logistic Regression and Decision Tree models showed low accuracy (around 14%), the Random Forest model achieved 100% accuracy, indicating potential overfitting. The report highlights the importance of addressing class imbalance and refining feature engineering for improved model reliability in genre classification.

Uploaded by

bangerx826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

Project - Report (Movie Genre Classification)

Uploaded by

bangerx826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

PROJECT REPORT

DR. AKHILESH DAS GUPTA INSTITUTE OF PROFESSIONAL STUDIES

MOVIE GENRE CLASSIFICATION

SUBMITTED BY:

Lakshay Kumar
CSE [0541602723]

Khushi
CSE [13115602723]

Abhinav Sharma
CSE [12815602723]
Abstract
The objective of this project was to develop a machine learning model capable of predicting the

genre of a movie based on its metadata and description. Using a dataset of 50,000 movies with 17

features—including title, year, director, cast, production details, and a brief description—the

project aimed to automate genre classification, which is valuable for recommendation systems

and content organization.

The methodology involved extensive data preprocessing, including handling missing values,

encoding categorical variables with one-hot encoding, and extracting text features from movie

descriptions using TF-IDF vectorization. Numerical features were scaled appropriately. Three

classification algorithms were implemented and evaluated: Logistic Regression, Decision Tree, and

Random Forest. The dataset was split into training and testing sets to assess model performance.

Key findings revealed that both Logistic Regression and Decision Tree models achieved low

accuracy (around 14%), indicating the challenge of the task and possible feature limitations. The

Random Forest model, however, showed perfect accuracy (100%), which suggests overfitting or

data leakage rather than genuine predictive power.

In conclusion, while the project successfully demonstrated the end-to-end process of genre

classification using machine learning, it also underscored the need for further improvements.

Addressing class imbalance, refining feature engineering, and preventing overfitting are essential

next steps to enhance model reliability and real-world applicability.

Introduction
Background of the Topic:

With the exponential growth of digital media, organizing and recommending movies efficiently has become

a significant challenge for streaming platforms and content providers. Movie genre classification,

traditionally performed manually, is essential for improving user experience, enabling effective search, and

powering recommendation engines. Leveraging machine learning to automate this process can greatly

enhance scalability and accuracy.

Objectives of the Project:

The primary objective of this project is to develop a machine learning model capable of accurately

predicting the genre of a movie based on its metadata and description. This involves preprocessing a large

dataset, engineering relevant features, and evaluating multiple classification algorithms to identify the most

effective approach.

Scope and Limitations:

The project focuses on supervised classification using a dataset of 50,000 movies with 17 features, including

both numerical and categorical data. The scope includes data cleaning, feature engineering, model training,

and evaluation. Limitations include potential class imbalance, limited feature diversity (e.g., reliance on

available metadata and descriptions), and the challenge of capturing nuanced genre distinctions.

Relevance/Importance of the Project:

Automating genre classification is highly relevant for the entertainment industry, especially for streaming

services and digital libraries. Accurate genre prediction enhances content discovery, personalised

recommendations, and streamlines content management. This project demonstrates the practical

application of machine learning in addressing real-world challenges in media organization and user

engagement.

DATASET
50,000 movies with 17 features

FEATURES:

Title – Name of the movie

Year – Release year
Director – Name of the director
Duration – Duration of the movie (in minutes)
Rating – IMDb or similar rating (float)
Votes – Number of user votes
Description – Short summary or plot description
Language – Primary language of the movie
Country – Country of origin
Budget_USD – Budget in US dollars
BoxOffice_USD – Box office revenue in US dollars
Genre – Target variable (movie genre)
Production_Company – Name of the production company
Content_Rating – Content rating (e.g., PG, R)
Lead_Actor – Name of the lead actor
Num_Awards – Number of awards won
Critic_Reviews – Number of critic reviews

Key Metrics:
Accuracy
Measures the proportion of correct predictions out of all predictions
made.
Example: Logistic Regression Test Accuracy = 14.8%

Precision

Indicates the proportion of positive identifications that were actually

correct (for each genre, averaged as “weighted”).

Example: Logistic Regression Test Precision = 0.046

Recall

Measures the proportion of actual positives that were correctly identified.

Example: Logistic Regression Test Recall = 0.148

F1 Score

Harmonic mean of precision and recall, providing a balance between the

two.

Example: Logistic Regression Test F1 Score = 0.057

Confusion Matrix
A table showing the number of correct and incorrect predictions for each
genre, helping to visualize model performance and errors.

These metrics were calculated for each model (Logistic Regression, Decision Tree, Random
Forest) on both training and test datasets to assess their effectiveness and identify issues
like overfitting or class imbalance.

DATA CLEANING & PREPROCESSING

These datasets contains respective columns and the description are as follows:

LIBRARIES:
Checking for null values & duplicate rows

Here we are checking for null values using isnull() & using duplicated() for duplicate values. The use of
sum() is to summing up all null & duplicate values.
Data Visualization
Histograms
➔ Purpose: To visualize the distribution of key numerical features.

➔ Features Visualized:
Duration
Votes
Budget_USD

➔ Insight: Shows the spread, skewness, and outliers in these features

➔ Boxplots

◆ Purpose: To identify the presence of outliers and understand the spread of all numeric
columns.

◆ Features Visualized:

● Year
● Duration

● Rating

● Votes

● Budget_USD

● BoxOffice_USD

● Num_Awards

● Critic_Reviews

◆ Insight: Highlights the median, quartiles, and outliers for each numeric feature.
➔ Pie Charts

◆ Purpose: To show the proportion of categories within categorical features.

◆ Features Visualized:

● Language

● Production_Company

◆ Insight: Visualizes the distribution of movies by language and production company.

Encoding for categorical columns &
defining Numerical columns

One-hot encoding was used to convert categorical features like Director,

Production_Company, Lead_Actor, Content_Rating, Language, and Country
into a numerical format. This allowed machine learning models to process
these variables effectively, ensuring that no ordinal relationships were
assumed between different categories during genre prediction.
Here we also defined the Numerical columns in X_num
MODEL BUILDING AND EVALUATION

X: Represents the features or independent variables. y:

Represents the target variable or dependent variable.
X_train: Represents the training data for features.
X_test: Represents the testing data for features.
y_train: Represents the training data for the target variable. y_test:
Represents the testing data for the target variable.

Scaling:

Numerical features (like Year, Duration, Rating, Votes, Budget, BoxOffice, Num_Awards,
Critic_Reviews) were scaled using MaxAbsScaler to ensure all values were on a similar scale. This
helps machine learning models converge faster and prevents features with larger values from
dominating the learning process.

Training:
The processed dataset was split into training and testing sets. Multiple classification
models—Logistic Regression, Decision Tree, and Random Forest—were trained on the scaled
training data to learn patterns for predicting movie genres. Model performance was then
evaluated on the test set using accuracy, precision, recall, and F1 score.
Logistic Regression Model

Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.039):
Proportion of predicted genres that were actually correct; only 3.9%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.053):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 5.3%.

DECISION TREE Model

Accuracy (14.4%):
Percentage of total predictions that were correct; only 14.4% of movie
genres were classified correctly by the model.
Precision (0.020):
Proportion of predicted genres that were actually correct; only 2.0%
of the model’s positive predictions were accurate. Recall (0.144):
Proportion of actual genres correctly identified by the model; the model
found 14.4% of all true genre cases.
F1 Score (0.036):
Harmonic mean of precision and recall; the model’s overall balance
between precision and recall was just 3.6%

RANDOM FOREST MODEL

Accuracy (100%):
Percentage of total predictions that were correct; the model classified all movie
genres correctly on the test set.
Precision (1.0):
Proportion of predicted genres that were actually correct; every positive
prediction made by the model was accurate.
Recall (1.0):
Proportion of actual genres correctly identified by the model; the model found all
true genre cases.
F1 Score (1.0):
Harmonic mean of precision and recall; the model achieved perfect balance
between precision and recall.

Predicted GENRE:
Here our model has predicted the possible genre for a given data of a movie.

CONCLUSION
In this project, we explored the application of machine learning techniques
to predict movie genres based on structured data and textual descriptions.
Using a dataset of 50,000 movies, we performed comprehensive
preprocessing that included handling categorical data through one-hot
encoding, transforming textual data using TF-IDF vectorization, and scaling
numerical features with MaxAbsScaler.
Three classification models—Logistic Regression, Decision Tree Classifier,
and Random Forest Classifier—were implemented and evaluated. Logistic
Regression and Decision Tree models showed limited predictive capability
with accuracy scores around 14–15%, precision below 5%, and low F1-
scores. These results indicate that genre prediction is a complex multi-class
classification problem, possibly affected by overlapping features and class
imbalance.

On the other hand, the Random Forest Classifier achieved perfect scores on
all metrics, suggesting overfitting to the training data rather than true
generalization. This highlights the importance of proper model tuning and
cross-validation to avoid misleading performance.

Overall, the project demonstrates the potential and challenges of genre

classification using machine learning. Future improvements could involve
balancing genre classes, tuning hyperparameters, applying deep learning
on text data, or integrating metadata with user engagement signals for
improved accuracy and generalizability.

Future Scope

1. Handle Class Imbalance

● Some genres are more frequent than others, which can bias the
model.
● Use techniques like SMOTE, under sampling, or class-weight
adjustment to balance the dataset.
2. Advanced Text Processing

● Replace TF-IDF with pre-trained language models like BERT,

RoBERTa, or DistilBERT to better capture semantic meaning in
movie descriptions.

3. Hyperparameter Tuning

● Use Grid Search or Random Search to optimize model

parameters for better generalization and reduced overfitting
(especially for Random Forest).

4. Use Deep Learning Models

● Implement models like LSTM, CNN for text, or transformer-

based architectures for improved performance, especially on
textual data.

Machine Learning Models Used – Explanation &

Justification
In this project, we used three different classification models: Logistic Regression, Decision Tree, and
Random Forest. Each model was selected to bring a different perspective to the genre prediction task
and to compare their strengths and weaknesses.

1. Logistic Regression (LR)

● Purpose in this model: Served as a baseline classifier to evaluate how well a linear model
performs with TF-IDF and one-hot encoded features.
● Use:
○ It was trained on scaled TF-IDF, categorical, and numerical data.
○ Helped assess whether genre labels could be separated linearly.
● Result: Very low accuracy (~14.85%) indicating that the relationship between features and
genre is likely non-linear.

2. Decision Tree Classifier (DT)

● Purpose in this model: To explore non-linear relationships and see how well a single-tree
structure could learn genre patterns.
● Use:
○ Allowed understanding of how specific features (e.g., lead actor, description keywords)
impact genre classification.
○ Showed which features had the highest importance.
● Result: Slightly lower performance than LR, with issues like overfitting and limited
generalization.

3. Random Forest Classifier (RF)

● Purpose in this model: Acted as a powerful ensemble learner to improve prediction

accuracy by reducing variance.
● Use:
○ Aggregated predictions from multiple decision trees to produce stable and accurate
predictions.
○ Effectively captured complex interactions between features.
● Result: Achieved 100% accuracy, likely due to overfitting, suggesting that it memorized the
training data but didn’t generalize well without regularization or tuning.

Project Drive Link: -

https://colab.research.google.com/drive/1x5HcMwPMMD_gfZYBCM5_kh0n8ebAuY9e?usp=sharing

—END OF REPORT—

Nursing Informatics
89% (9)
Nursing Informatics
43 pages
DMDW G3
No ratings yet
DMDW G3
16 pages
Project: Predicting Box Office Revenues: A Report Submitted To
No ratings yet
Project: Predicting Box Office Revenues: A Report Submitted To
10 pages
Mini Project
No ratings yet
Mini Project
10 pages
Movie Success Prediction Tool
100% (1)
Movie Success Prediction Tool
3 pages
DSF Final Project
No ratings yet
DSF Final Project
6 pages
Digital Oil Field (DOF)
No ratings yet
Digital Oil Field (DOF)
2 pages
Bheem Final
No ratings yet
Bheem Final
65 pages
System On Chip
No ratings yet
System On Chip
12 pages
Mathematics 11 04735 v3
No ratings yet
Mathematics 11 04735 v3
26 pages
Fire Panel Guide for Engineers
100% (1)
Fire Panel Guide for Engineers
11 pages
Movie Genre Prediction From Plot Summaries by Comparing Various Classification Algorithms
No ratings yet
Movie Genre Prediction From Plot Summaries by Comparing Various Classification Algorithms
3 pages
Movie Recommendation System Using ML: Submitted By
No ratings yet
Movie Recommendation System Using ML: Submitted By
32 pages
IARPA Cyber-Attack Automated Unconventional Sensor Environment (CAUSE)
No ratings yet
IARPA Cyber-Attack Automated Unconventional Sensor Environment (CAUSE)
93 pages
Single Axis Solar Tracking System Using Microcontroller (ATmega328) and Servo Motor
No ratings yet
Single Axis Solar Tracking System Using Microcontroller (ATmega328) and Servo Motor
4 pages
Movie Prediction
100% (1)
Movie Prediction
7 pages
Wa0008.
No ratings yet
Wa0008.
21 pages
Annihilator Method
100% (1)
Annihilator Method
7 pages
IMDB MOVIES Analysis
No ratings yet
IMDB MOVIES Analysis
13 pages
Faids Final Report.. 1 1
No ratings yet
Faids Final Report.. 1 1
30 pages
Dsbda Mini Project
No ratings yet
Dsbda Mini Project
14 pages
Movie Success Prediction Using Machine Learning Algorithms and Their Comparison
No ratings yet
Movie Success Prediction Using Machine Learning Algorithms and Their Comparison
6 pages
Movie Genre Tools Explanation
No ratings yet
Movie Genre Tools Explanation
2 pages
Movie Rating Prediction Presentation
No ratings yet
Movie Rating Prediction Presentation
11 pages
1 s2.0 S1877050923001771 Main
No ratings yet
1 s2.0 S1877050923001771 Main
11 pages
IT355: Soft Computing: Paper Implementation
No ratings yet
IT355: Soft Computing: Paper Implementation
14 pages
Predicting Movie Rating Prior To Release
No ratings yet
Predicting Movie Rating Prior To Release
15 pages
b1 PDF
No ratings yet
b1 PDF
6 pages
Hdo6000a Operators Manual
No ratings yet
Hdo6000a Operators Manual
212 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
32 pages
Internet Movie Database Analysis Using Python
No ratings yet
Internet Movie Database Analysis Using Python
6 pages
ForecastingMovieRatingThroughDataAnalytics REDSET2019 CCIS
No ratings yet
ForecastingMovieRatingThroughDataAnalytics REDSET2019 CCIS
10 pages
Movie Success Prediction for Analysts
No ratings yet
Movie Success Prediction for Analysts
2 pages
A Machine Learning Approach To Predict M
No ratings yet
A Machine Learning Approach To Predict M
66 pages
Analyzing and Predicting The Success of Box Office Collection of A Movie Using Machine Learning
No ratings yet
Analyzing and Predicting The Success of Box Office Collection of A Movie Using Machine Learning
7 pages
Final Project - CS181
No ratings yet
Final Project - CS181
3 pages
Efficient Features For Movie
No ratings yet
Efficient Features For Movie
53 pages
DM 8
No ratings yet
DM 8
6 pages
A Movie Recommendation System (Amrs)
No ratings yet
A Movie Recommendation System (Amrs)
27 pages
Aravindan Ingersol IMI Delhi Method
No ratings yet
Aravindan Ingersol IMI Delhi Method
1 page
Cinematic Recommendation System
No ratings yet
Cinematic Recommendation System
10 pages
Manual - Bancada Presys
No ratings yet
Manual - Bancada Presys
39 pages
ColorGATE RIP-Software Release Notes 8.00 Build 5055
No ratings yet
ColorGATE RIP-Software Release Notes 8.00 Build 5055
34 pages
A Project-Based Seminar Report On Movie Rating Prediction System
100% (2)
A Project-Based Seminar Report On Movie Rating Prediction System
21 pages
Team Renegades MMLA Report
No ratings yet
Team Renegades MMLA Report
27 pages
Conference Paper
No ratings yet
Conference Paper
6 pages
Predicting Movie Success Based On Imdb Data
No ratings yet
Predicting Movie Success Based On Imdb Data
5 pages
Rule of Thumb Calculator Instruction
100% (3)
Rule of Thumb Calculator Instruction
26 pages
Netflix Recommendation Based On IMDB
No ratings yet
Netflix Recommendation Based On IMDB
5 pages
Iml Project Proposal
No ratings yet
Iml Project Proposal
5 pages
DPS5020 Operating Manual
No ratings yet
DPS5020 Operating Manual
9 pages
Final Review
No ratings yet
Final Review
24 pages
B.Tech CSE Algorithm Design Notes
No ratings yet
B.Tech CSE Algorithm Design Notes
126 pages
Predicting Movie Ratings With Multimodal Data: Yichen Yang Ruoyun Ma Min Haeng Cho
No ratings yet
Predicting Movie Ratings With Multimodal Data: Yichen Yang Ruoyun Ma Min Haeng Cho
6 pages
DM Final Report
No ratings yet
DM Final Report
4 pages
Movies Recommendation Using Machine Learning - Research Paper
No ratings yet
Movies Recommendation Using Machine Learning - Research Paper
11 pages
First Quarter Examination in Epas G12
100% (1)
First Quarter Examination in Epas G12
3 pages
Movie Recommendation Engine Using Artificial Intelligence
No ratings yet
Movie Recommendation Engine Using Artificial Intelligence
30 pages
Netflix Prize: All Together Now: A Perspective On The
No ratings yet
Netflix Prize: All Together Now: A Perspective On The
1 page
IntelliSteer Operating Guide PDF
No ratings yet
IntelliSteer Operating Guide PDF
240 pages
Synopsis On Mobile Control Robot
No ratings yet
Synopsis On Mobile Control Robot
5 pages
Mamata Java Developer
No ratings yet
Mamata Java Developer
7 pages
MovieLens Project Report
No ratings yet
MovieLens Project Report
19 pages
Review 2
No ratings yet
Review 2
21 pages
Batch 13
No ratings yet
Batch 13
11 pages
51 Cutover Templates
100% (2)
51 Cutover Templates
13 pages
Review 1
No ratings yet
Review 1
18 pages
L21 L22 Varying CTReconstruction Parameters
No ratings yet
L21 L22 Varying CTReconstruction Parameters
24 pages
SC 900T00A ENU PowerPoint - 01
No ratings yet
SC 900T00A ENU PowerPoint - 01
20 pages
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
No ratings yet
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
50 pages
UML Class Diagram
No ratings yet
UML Class Diagram
4 pages
Movie Success Prediction Using Data Mining
No ratings yet
Movie Success Prediction Using Data Mining
3 pages
Exam Paper 2020 Oct
100% (1)
Exam Paper 2020 Oct
7 pages
Movie Success Prediction Model
No ratings yet
Movie Success Prediction Model
4 pages
Kastle v1.5 Assembly Guide
No ratings yet
Kastle v1.5 Assembly Guide
16 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
MOSFET Basics for Engineering Students
No ratings yet
MOSFET Basics for Engineering Students
46 pages
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
No ratings yet
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
21 pages
TC2 Lecture Note
No ratings yet
TC2 Lecture Note
11 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
Server Administration and Management
No ratings yet
Server Administration and Management
3 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
Predicting The Success of A Movie Using Machine Learning Algorithms: An Analysis
No ratings yet
Predicting The Success of A Movie Using Machine Learning Algorithms: An Analysis
8 pages
Predicting Movie Success Based On IMDB Data
No ratings yet
Predicting Movie Success Based On IMDB Data
4 pages
Prediks I Movie
No ratings yet
Prediks I Movie
25 pages
Report
No ratings yet
Report
11 pages
Movie Recommender System Using Content Based AndCollaborative Filtering
No ratings yet
Movie Recommender System Using Content Based AndCollaborative Filtering
7 pages
BPM Strategies for Enterprises
No ratings yet
BPM Strategies for Enterprises
10 pages

Project - Report (Movie Genre Classification)

Uploaded by

Project - Report (Movie Genre Classification)

Uploaded by

PROJECT REPORT

DR. AKHILESH DAS GUPTA INSTITUTE OF PROFESSIONAL STUDIES

MOVIE GENRE CLASSIFICATION

and content organization.

data leakage rather than genuine predictive power.

next steps to enhance model reliability and real-world applicability.

enhance scalability and accuracy.

Scope and Limitations:

Relevance/Importance of the Project:

Title – Name of the movie

Indicates the proportion of positive identifications that were actually

Example: Logistic Regression Test Precision = 0.046

Measures the proportion of actual positives that were correctly identified.

Example: Logistic Regression Test Recall = 0.148

Harmonic mean of precision and recall, providing a balance between the

Example: Logistic Regression Test F1 Score = 0.057

DATA CLEANING & PREPROCESSING

➔ Insight: Shows the spread, skewness, and outliers in these features

◆ Purpose: To show the proportion of categories within categorical features.

◆ Insight: Visualizes the distribution of movies by language and production company.

One-hot encoding was used to convert categorical features like Director,

X: Represents the features or independent variables. y:

DECISION TREE Model

RANDOM FOREST MODEL

Overall, the project demonstrates the potential and challenges of genre

1. Handle Class Imbalance

● Replace TF-IDF with pre-trained language models like BERT,

● Use Grid Search or Random Search to optimize model

4. Use Deep Learning Models

● Implement models like LSTM, CNN for text, or transformer-

Machine Learning Models Used – Explanation &

1. Logistic Regression (LR)

2. Decision Tree Classifier (DT)

3. Random Forest Classifier (RF)

● Purpose in this model: Acted as a powerful ensemble learner to improve prediction

Project Drive Link: -

You might also like