VISVESVARAYA TECHNOLOGICAL UNIVERSITY,
Jnana Sangama, Belgaum-590018
A PROJECT REPORT ON
“ANALYZING SPOTIFY STREAMING DATA”
An Activity Report Submitted in partial fulfillment of requirement for the award of 6th
semester of
BACHELOR OF ENGINEERING (B.E)
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING ENGINEERING
SUBMITTED BY
Manjunath (3GN21AI016)
Amul (3GN21AI004)
Sai Kumar(3GN21AI045)
UNDER THE GUIDANCE OF
Prof. JASMINEET KAUR
ARORA DR. HARISH JOSHI
GURU NANAK DEV ENGINEERING COLLEGE, BIDAR
MAILOOR ROAD, BIDAR, KARNATAKA-585403
CHAPTER 1
PROBLEM STATEMENT
Analyze and predict ratings and popularity on Spotify using user reviews, attributes, and location data to
provide insights for diners seeking quality music experiences and to enhance service and reputation
management strategies.
STEPS TO BE FOLLOWED
Exploratory Data Analysis
Installing Libraries and Modules
Loading Data
Data Inspection
Understanding Variables
Data Wrangling
Feature Engineering
Data Visualization
Histograms
Scatter Plots
Pair Plots
Hypothesis Testing
Data Cleaning and Preparation
Exploratory Data Analysis (EDA)
Feature Selection
Model Training and Evaluation
Visualization
Machine Learning Models
Random Forests
Extra tree regression
Decision Tree
Linear regression
IMPORT LIBRARIES AND MODULES
The code snippet you've provided is used for various data analysis and natural language processing tasks in
Python. Let's break down what each part does:
1. Importing Libraries:
import pandas as pd: Imports the Pandas library for data manipulation and analysis.
import numpy as np: Imports the NumPy library for numerical operations.
import matplotlib.pyplot as plt: Imports Matplotlib's pyplot module for creating visualizations.
import seaborn as sns: Imports the Seaborn library for statistical data visualization.
Sklearn: Scikit-learn is a library in Python that provides many unsupervised and supervised
learning algorithms.
2. Text Processing Libraries:
from sklearn.linear_model import LinearRegression, LogisticRegression.
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
Purpose:
The primary purpose of the code you provided is to analyze and predict restaurant ratings based on various
features from a Zomato dataset. The process involves several steps, including data cleaning, exploration,
transformation, and the implementation of machine learning models to understand the relationships between
the features and the restaurant ratings.
LET'S BEGIN!!!
# Importing Libraries
import numpy as
np import pandas
as pd
import matplotlib.pyplot as
plt import seaborn as sns
from sklearn.linear_model import
LinearRegression from sklearn.model_selection
import train_test_split from sklearn.metrics
import r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
# Reading the dataset
zomato_orgnl = pd.read_csv("zomato.csv")
# Data Cleaning
def clean_data(df):
df = df.drop(['url', 'dish_liked', 'phone'],
axis=1) df = df.drop_duplicates()
df = df.dropna(how='any')
df = df.rename(columns={'approx_cost(for two people)': 'cost', 'listed_in(type)':
'type', 'listed_in(city)': 'city'})
df['cost'] = df['cost'].astype(str).apply(lambda x: x.replace(',',
'.')).astype(float) df = df.loc[df.rate != 'NEW']
df = df.loc[df.rate != '-'].reset_index(drop=True)
df['rate'] = df['rate'].apply(lambda x: x.replace('/5', '') if type(x) == str else
x).str.strip().astype('float')
df['name'] = df['name'].apply(lambda x: x.title())
df.online_order.replace(('Yes', 'No'), (True, False),
inplace=True) df.book_table.replace(('Yes', 'No'), (True,
False), inplace=True) return df
zomato = clean_data(zomato_orgnl.copy())
o/p-
<class
'pandas.core.frame.DataFrame'>
Index: 43499 entries, 0 to 51716
Data columns (total 14 columns):
# Column Non-Null Count Dtype
0 address 43499 non- objec
null t
1 name 43499 non- objec
null t
2 online_order 43499 non- objec
null t
3 book_table 43499 non- objec
null t
4 rate 43499 non- objec
null t
5 votes 43499 non- int64
null
6 location 43499 non- objec
null t
7 rest_type 43499 non-null object
8 cuisines 43499 non-null object
9 approx_cost(for two people) 43499 non-null object
10 reviews_list 43499 non-null object
11 menu_item 43499 non-null object
12 listed_in(type) 43499 non-null object
13 listed_in(city) 43499 non-null
object dtypes: int64(1), object(13)
memory usage: 5.0+ MB
# Encode categorical variables
def encode_data(df):
for column in df.columns[~df.columns.isin(['rate', 'cost',
'votes'])]: df[column] = df[column].factorize()[0]
return df
zomato_en = encode_data(zomato.copy())
o/p-
Index(['address', 'name', 'online_order', 'book_table', 'rate',
'votes', 'location', 'rest_type', 'cuisines', 'cost',
'reviews_list', 'menu_item', 'type', 'city'],
dtype='object')
# Correlation Heatmap
corr =
zomato_en.corr(method='kendall')
plt.figure(figsize=(15, 8))
sns.heatmap(corr, annot=True)
plt.show()
o/p-
# Define features and target
x = zomato_en[['online_order', 'book_table', 'votes', 'location', 'rest_type', 'type',
'cost']] y = zomato_en['rate']
# Train and evaluate models
def evaluate_model(model, x_train, x_test, y_train,
y_test): model.fit(x_train, y_train)
y_pred =
model.predict(x_test) return
r2_score(y_test, y_pred)
o/p-
16950 3.9
767 3.7
6750 4.0
9471 3.8
25162 3.7
Name: rate, dtype:
float64
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1,
random_state=353)
# Linear Regression
reg=LinearRegression()
reg.fit(x_train,y_train)
y_pred=reg.predict(x_test)
from sklearn.metrics import
r2_score r2_score(y_test,y_pred)
o/p- 0.2736233722103949
# Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=105)
DTree=DecisionTreeRegressor(min_samples_leaf=.0001)
DTree.fit(x_train,y_train)
y_predict=DTree.predict(x_test)
from sklearn.metrics import
r2_score
r2_score(y_test,y_predict)
o/p-0.857606513716891
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
RForest=RandomForestRegressor(n_estimators=500,random_state=329,min_samples_leaf=.0001)
RForest.fit(x_train,y_train)
y_predict=RForest.predict(x_test)
from sklearn.metrics import
r2_score
r2_score(y_test,y_predict)
o/p-0.8774282743423502
# Extra Trees Regression
from sklearn.ensemble import ExtraTreesRegressor
ETree=ExtraTreesRegressor(n_estimators = 100)
ETree.fit(x_train,y_train)
y_predict=ETree.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_predict)
o/p-0.9402632984892167
# Visualization: Location
sns.countplot(zomato['city'])
sns.countplot(zomato['city']).set_xticklabels(sns.countplot(zomato['city']).get_xticklabel
s(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(13,13)
plt.title('Location')
o/p-
# Visualization: Location and Rating
loc_plt=pd.crosstab(zomato['rate'],zomato['city'])
loc_plt.plot(kind='bar',stacked=True);
plt.title('Location -
Rating',fontsize=15,fontweight='bold')
plt.ylabel('Location',fontsize=10,fontweight='bold')
plt.xlabel('Rating',fontsize=10,fontweight='bold')
plt.xticks(fontsize=10,fontweight='bold')
plt.yticks(fontsize=10,fontweight='bold');
plt.legend().remove();
o/p-
# Visualization: Spotify Type
sns.countplot(zomato['rest_type'])
sns.countplot(zomato['rest_type']).set_xticklabels(sns.countplot(zomato['rest_type'].
get_xticklabels(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(15,15)
plt.title('Restuarant Type')
o/p-
# Visualization: Types of Services
sns.countplot(zomato['type'])
sns.countplot(zomato['type']).set_xticklabels(sns.countplot(zomato['type']).get_xticklabel
s(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(15,15)
plt.title('Type of Service')
o/p-
# Visualization: Type and Rating
type_plt=pd.crosstab(spotify['rate'],spotify['type'])
type_plt.plot(kind='bar',stacked=True);
plt.title('Type -
Rating',fontsize=15,fontweight='bold')
plt.ylabel('Type',fontsize=10,fontweight='bold')
plt.xlabel('Rating',fontsize=10,fontweight='bold')
plt.xticks(fontsize=10,fontweight='bold')
plt.yticks(fontsize=10,fontweight='bold');
o/p-
import matplotlib.pyplot as
plt import seaborn as sns
import pandas as pd
zomato = pd.read_csv('zomato.csv')
plt.figure(figsize=(15,7))
chains=zomato['name'].value_counts()[:20]
sns.barplot(x=chains,y=chains.index,palette='Set1')
plt.title("Most famous restaurant chains in
Bangaluru",size=20,pad=20) plt.xlabel("Number of
outlets",size=15)
o/p-# Visualization: Table booking Rate vs Rate
plt.rcParams['figure.figsize'] = (13, 9)
Y = pd.crosstab(zomato['rate'], zomato['book_table'])
Y.div(Y.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked =
True,color=['red','yellow']) plt.title('table booking vs rate', fontweight = 30,
fontsize = 20)
plt.legend(loc="upper
right") plt.show()
o/p-
Conclusion
The code effectively cleans and prepares the spotify dataset by removing unnecessary columns,
handling duplicates and missing values, and transforming data types for analysis. Through exploratory data
analysis, it visualizes correlations and distributions of features like, and service type. The selected features,
such as online_order, book_table, votes, location, rest_type, type, and cost, are encoded and used to train
multiple regression models, including Linear Regression, Decision Tree, Random Forest, and Extra Trees,
each evaluated using the R-squared score. The visualizations provide insights into the distribution of
restaurants, types of services.