Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views16 pages

MLLABDSA

ml lab assignment

Uploaded by

mahesh bochare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

MLLABDSA

ml lab assignment

Uploaded by

mahesh bochare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

VISVESVARAYA TECHNOLOGICAL UNIVERSITY,

Jnana Sangama, Belgaum-590018

A PROJECT REPORT ON

“ANALYZING SPOTIFY STREAMING DATA”

An Activity Report Submitted in partial fulfillment of requirement for the award of 6th
semester of
BACHELOR OF ENGINEERING (B.E)
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING ENGINEERING
SUBMITTED BY
Manjunath (3GN21AI016)
Amul (3GN21AI004)
Sai Kumar(3GN21AI045)

UNDER THE GUIDANCE OF


Prof. JASMINEET KAUR
ARORA DR. HARISH JOSHI

GURU NANAK DEV ENGINEERING COLLEGE, BIDAR


MAILOOR ROAD, BIDAR, KARNATAKA-585403
CHAPTER 1

PROBLEM STATEMENT

Analyze and predict ratings and popularity on Spotify using user reviews, attributes, and location data to
provide insights for diners seeking quality music experiences and to enhance service and reputation
management strategies.

STEPS TO BE FOLLOWED

Exploratory Data Analysis

 Installing Libraries and Modules


 Loading Data
 Data Inspection
 Understanding Variables
 Data Wrangling
 Feature Engineering

Data Visualization

 Histograms
 Scatter Plots
 Pair Plots

Hypothesis Testing
 Data Cleaning and Preparation
 Exploratory Data Analysis (EDA)
 Feature Selection
 Model Training and Evaluation
 Visualization

Machine Learning Models


 Random Forests
 Extra tree regression
 Decision Tree
 Linear regression
IMPORT LIBRARIES AND MODULES

The code snippet you've provided is used for various data analysis and natural language processing tasks in
Python. Let's break down what each part does:

1. Importing Libraries:

 import pandas as pd: Imports the Pandas library for data manipulation and analysis.

 import numpy as np: Imports the NumPy library for numerical operations.

 import matplotlib.pyplot as plt: Imports Matplotlib's pyplot module for creating visualizations.

 import seaborn as sns: Imports the Seaborn library for statistical data visualization.

 Sklearn: Scikit-learn is a library in Python that provides many unsupervised and supervised
learning algorithms.

2. Text Processing Libraries:

 from sklearn.linear_model import LinearRegression, LogisticRegression.

 from sklearn.model_selection import train_test_split

 from sklearn.metrics import r2_score, classification_report, confusion_matrix

 from sklearn.tree import DecisionTreeRegressor

 from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor


Purpose:

The primary purpose of the code you provided is to analyze and predict restaurant ratings based on various
features from a Zomato dataset. The process involves several steps, including data cleaning, exploration,
transformation, and the implementation of machine learning models to understand the relationships between
the features and the restaurant ratings.

LET'S BEGIN!!!

# Importing Libraries

import numpy as
np import pandas
as pd
import matplotlib.pyplot as
plt import seaborn as sns
from sklearn.linear_model import
LinearRegression from sklearn.model_selection
import train_test_split from sklearn.metrics
import r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

# Reading the dataset

zomato_orgnl = pd.read_csv("zomato.csv")

# Data Cleaning

def clean_data(df):
df = df.drop(['url', 'dish_liked', 'phone'],
axis=1) df = df.drop_duplicates()
df = df.dropna(how='any')
df = df.rename(columns={'approx_cost(for two people)': 'cost', 'listed_in(type)':
'type', 'listed_in(city)': 'city'})
df['cost'] = df['cost'].astype(str).apply(lambda x: x.replace(',',
'.')).astype(float) df = df.loc[df.rate != 'NEW']
df = df.loc[df.rate != '-'].reset_index(drop=True)
df['rate'] = df['rate'].apply(lambda x: x.replace('/5', '') if type(x) == str else
x).str.strip().astype('float')
df['name'] = df['name'].apply(lambda x: x.title())
df.online_order.replace(('Yes', 'No'), (True, False),
inplace=True) df.book_table.replace(('Yes', 'No'), (True,
False), inplace=True) return df

zomato = clean_data(zomato_orgnl.copy())

o/p-
<class
'pandas.core.frame.DataFrame'>
Index: 43499 entries, 0 to 51716
Data columns (total 14 columns):
# Column Non-Null Count Dtype

0 address 43499 non- objec


null t
1 name 43499 non- objec
null t
2 online_order 43499 non- objec
null t
3 book_table 43499 non- objec
null t
4 rate 43499 non- objec
null t
5 votes 43499 non- int64
null
6 location 43499 non- objec
null t
7 rest_type 43499 non-null object
8 cuisines 43499 non-null object
9 approx_cost(for two people) 43499 non-null object
10 reviews_list 43499 non-null object

11 menu_item 43499 non-null object


12 listed_in(type) 43499 non-null object
13 listed_in(city) 43499 non-null
object dtypes: int64(1), object(13)
memory usage: 5.0+ MB

# Encode categorical variables

def encode_data(df):
for column in df.columns[~df.columns.isin(['rate', 'cost',
'votes'])]: df[column] = df[column].factorize()[0]
return df
zomato_en = encode_data(zomato.copy())

o/p-
Index(['address', 'name', 'online_order', 'book_table', 'rate',
'votes', 'location', 'rest_type', 'cuisines', 'cost',
'reviews_list', 'menu_item', 'type', 'city'],
dtype='object')

# Correlation Heatmap

corr =
zomato_en.corr(method='kendall')
plt.figure(figsize=(15, 8))
sns.heatmap(corr, annot=True)
plt.show()
o/p-
# Define features and target
x = zomato_en[['online_order', 'book_table', 'votes', 'location', 'rest_type', 'type',
'cost']] y = zomato_en['rate']

# Train and evaluate models


def evaluate_model(model, x_train, x_test, y_train,
y_test): model.fit(x_train, y_train)
y_pred =
model.predict(x_test) return
r2_score(y_test, y_pred)
o/p-
16950 3.9
767 3.7
6750 4.0
9471 3.8
25162 3.7
Name: rate, dtype:
float64

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1,


random_state=353)

# Linear Regression

reg=LinearRegression()
reg.fit(x_train,y_train)
y_pred=reg.predict(x_test)
from sklearn.metrics import
r2_score r2_score(y_test,y_pred)

o/p- 0.2736233722103949

# Decision Tree Regression

from sklearn.tree import DecisionTreeRegressor


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.1,random_state=105)
DTree=DecisionTreeRegressor(min_samples_leaf=.0001)
DTree.fit(x_train,y_train)
y_predict=DTree.predict(x_test)
from sklearn.metrics import
r2_score
r2_score(y_test,y_predict)

o/p-0.857606513716891

# Random Forest Regression

from sklearn.ensemble import RandomForestRegressor


RForest=RandomForestRegressor(n_estimators=500,random_state=329,min_samples_leaf=.0001)
RForest.fit(x_train,y_train)
y_predict=RForest.predict(x_test)
from sklearn.metrics import
r2_score
r2_score(y_test,y_predict)

o/p-0.8774282743423502

# Extra Trees Regression

from sklearn.ensemble import ExtraTreesRegressor


ETree=ExtraTreesRegressor(n_estimators = 100)
ETree.fit(x_train,y_train)
y_predict=ETree.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_predict)
o/p-0.9402632984892167
# Visualization: Location

sns.countplot(zomato['city'])
sns.countplot(zomato['city']).set_xticklabels(sns.countplot(zomato['city']).get_xticklabel
s(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(13,13)
plt.title('Location')

o/p-
# Visualization: Location and Rating

loc_plt=pd.crosstab(zomato['rate'],zomato['city'])
loc_plt.plot(kind='bar',stacked=True);
plt.title('Location -
Rating',fontsize=15,fontweight='bold')
plt.ylabel('Location',fontsize=10,fontweight='bold')
plt.xlabel('Rating',fontsize=10,fontweight='bold')
plt.xticks(fontsize=10,fontweight='bold')
plt.yticks(fontsize=10,fontweight='bold');
plt.legend().remove();

o/p-
# Visualization: Spotify Type

sns.countplot(zomato['rest_type'])
sns.countplot(zomato['rest_type']).set_xticklabels(sns.countplot(zomato['rest_type'].
get_xticklabels(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(15,15)
plt.title('Restuarant Type')

o/p-
# Visualization: Types of Services

sns.countplot(zomato['type'])
sns.countplot(zomato['type']).set_xticklabels(sns.countplot(zomato['type']).get_xticklabel
s(), rotation=90, ha="right")
fig = plt.gcf()
fig.set_size_inches(15,15)
plt.title('Type of Service')

o/p-
# Visualization: Type and Rating

type_plt=pd.crosstab(spotify['rate'],spotify['type'])
type_plt.plot(kind='bar',stacked=True);
plt.title('Type -
Rating',fontsize=15,fontweight='bold')
plt.ylabel('Type',fontsize=10,fontweight='bold')
plt.xlabel('Rating',fontsize=10,fontweight='bold')
plt.xticks(fontsize=10,fontweight='bold')
plt.yticks(fontsize=10,fontweight='bold');

o/p-
import matplotlib.pyplot as
plt import seaborn as sns
import pandas as pd
zomato = pd.read_csv('zomato.csv')
plt.figure(figsize=(15,7))
chains=zomato['name'].value_counts()[:20]
sns.barplot(x=chains,y=chains.index,palette='Set1')
plt.title("Most famous restaurant chains in
Bangaluru",size=20,pad=20) plt.xlabel("Number of
outlets",size=15)

o/p-# Visualization: Table booking Rate vs Rate

plt.rcParams['figure.figsize'] = (13, 9)
Y = pd.crosstab(zomato['rate'], zomato['book_table'])
Y.div(Y.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked =
True,color=['red','yellow']) plt.title('table booking vs rate', fontweight = 30,
fontsize = 20)
plt.legend(loc="upper
right") plt.show()
o/p-
Conclusion
The code effectively cleans and prepares the spotify dataset by removing unnecessary columns,
handling duplicates and missing values, and transforming data types for analysis. Through exploratory data
analysis, it visualizes correlations and distributions of features like, and service type. The selected features,
such as online_order, book_table, votes, location, rest_type, type, and cost, are encoded and used to train
multiple regression models, including Linear Regression, Decision Tree, Random Forest, and Extra Trees,
each evaluated using the R-squared score. The visualizations provide insights into the distribution of
restaurants, types of services.

You might also like