Spotify
Data
Analysis
Python
Project🎼🎧
Harsh veer / Data analyst
INTRODUCTION:
In todays changing world data analysis has become crucial in fields such, as
business, research and meteorology.
The immense potential of data analysis is evident in this project, which focuses on
extracting insights from music related datasets using Python. At its core Spotify
takes stage as an audio streaming giant with captivating features like seamless
song sharing and synchronized lyrics display.
From analyzing to visualizing the data this project covers all aspects of data
processing. The interactive environment provided by Jupyter notebook enhanced
my experience by allowing me to engage with the data and discover patterns.
• Tools Used:
Programming Language: Python
• Libraries: Pandas, Numpy, Matplotlib, Seaborn
• IDE: Jupyter Notebook
IMPORT REQUIRED LIBRARIES
• import numpy as np: This imports the NumPy library and aliases it as 'np'.
NumPy is used for numerical computations and provides support for arrays
and matrices.
• import pandas as pd: This imports the Pandas library and aliases it as 'pd'.
Pandas is used for data manipulation and analysis, providing data structures
like DataFrames for tabular data.
• import matplotlib.pyplot as plt: This imports the Pyplot module from the
Matplotlib library and aliases it as 'plt'. Matplotlib is a popular plotting library
in Python, and Pyplot provides a convenient interface to create
visualizations.
• import seaborn as sns: This imports the Seaborn library and aliases it as 'sns'.
Seaborn is built on top of Matplotlib and offers a higher-level interface for
creating attractive statistical visualizations.
EXPLORING THE DATASET
• sp_tracks = pd.read_csv('D:/spotifydata/tracks.csv')
• sp_feature = pd.read_csv('D:/spotifydata/SpotifyFeatures.csv')
• sp_tracks = pd.read_csv('D:/spotifydata/tracks.csv'): This line reads a CSV
file named 'tracks.csv' located at the 'D:/spotifydata/' directory and loads its
data into a Pandas DataFrame called sp_tracks. This DataFrame is likely to
contain information about tracks.
• sp_feature = pd.read_csv('D:/spotifydata/SpotifyFeatures.csv'): This line
reads another CSV file named 'SpotifyFeatures.csv' from the same directory
and loads its data into a separate Pandas DataFrame called sp_feature.
This DataFrame probably contains additional features or attributes related to
the Spotify tracks.
#viewing the tracks data
sp_tracks.head()
• NOTE:The image provided is not the entirety of the complete image,
as there are restrictions in capturing full images through
screenshots. To access the comprehensive table, please refer to the
Jupyter notebook folder within this repository.
• sp_tracks.head(): This line of code calls the head() method on the
'sp_tracks' DataFrame. The head() method is used to display the first few
rows of the DataFrame. This is useful for quickly getting an overview of the
data.
• sp_tracks.head(): This line of code calls the head() method on the
#viewing the feature data
• sp_feature.head()
• #viewing the feature data: This is a comment that indicates the following line
of code is intended to display or view the data in the 'sp_feature'
DataFrame.
• sp_feature.head(): This line of code calls the head() method on the
'sp_feature' DataFrame. The head() method is used to display the first few
rows of the DataFrame. This allows you to quickly inspect the initial records
and get a sense of the data.
IDENTIFYING NULL
VALUES IN THE DATASET
#checking null in tracks data
• pd.isnull(sp_tracks).sum()
pd.isnull(sp_tracks).sum(): This line of code
uses the pd.isnull() function on the 'sp_tracks'
DataFrame to create a boolean DataFrame
where each cell contains True if the
corresponding cell in the original DataFrame is
null and False otherwise. The .sum() function is
then used to count the number of True values in
each column, effectively giving you the count of
missing values in each column.
Did the same for the feature data.
Dataset Overview: Rows, Columns, Data Types, and
Memory Usage
#checking info in tracks
• data sp_tracks.info()
• sp_tracks.info(): This line of code calls
the info() method on the 'sp_tracks'
DataFrame. The info() method provides a
concise summary of the DataFrame,
including the data types of each column, the
number of non-null values, and memory
usage. It's a useful way to get a quick
overview of the data and its structure.
• Did the same for features data.
Extracting Insights from the Dataset through Analysis📊
1.Exploring the 10 Least Popular Songs in the
Spotify Dataset
a=sp_tracks.sort_values('popularity',ascendi
ng=True)[0:10]
a[['name','popularity']]
a = sp_tracks.sort_values('popularity',
ascending=True)[0:10]: This line of code
creates a new DataFrame a by sorting the
'sp_tracks' DataFrame based on the
'popularity' column in ascending order.
The [0:10] notation selects the first 10 rows
of the sorted DataFrame, effectively selecting
the 10 least popular tracks.
a[['name', 'popularity']]: This line of code
selects specific columns, namely 'name' and
'popularity', from the DataFrame a created in
the previous line. This will show the names of
the 10 least popular tracks along with their
corresponding popularity values.
2.Discovering the Top 10 Popular
Songs in the Spotify Dataset
a=sp_tracks
b=a[a['popularity']>90].sort_values('popularity',asc
ending=False)[:10] b[['name','popularity','artists']]
a = sp_tracks: This line of code assigns the 'sp_tracks'
DataFrame to a new DataFrame variable a.
b = a[a['popularity'] > 90].sort_values('popularity',
ascending=False)[:10]: This line of code creates a new
DataFrame b by selecting rows from the DataFrame a where
the 'popularity' column is greater than 90. The DataFrame is
then sorted in descending order based on the 'popularity'
column, and the first 10 rows are selected. This effectively
gives you the top 10 most popular tracks.
b[['name', 'popularity', 'artists']]: This line of code selects
specific columns ('name', 'popularity', and 'artists') from the
DataFrame b created in the previous line. This will display
the names, popularity values, and artist information of the
top 10 most popular tracks.
3.Setting Release Date as the Index Column
sp_tracks.set_index('release_date',inplace=True)
sp_tracks.index=pd.to_datetime(sp_tracks.index)
sp_tracks.head()
• sp_tracks.set_index('release_date', inplace=True): This line of code sets the 'release_date' column as
the index of the 'sp_tracks' DataFrame. The inplace=True argument modifies the DataFrame in place,
meaning the change is applied directly to the original DataFrame.
• sp_tracks.index = pd.to_datetime(sp_tracks.index): This line of code converts the index of the
'sp_tracks' DataFrame to a datetime format using the pd.to_datetime() function. This is often done to
ensure that the index represents dates in a meaningful way, allowing for time-based operations.
• sp_tracks.head(): This line of code calls the head() method on the 'sp_tracks' DataFrame, which will
display the first few rows of the DataFrame with the updated index.
4. Converting Song Duration from Milliseconds to Seconds
sp_tracks['duration'] = sp_tracks['duration_ms'].apply (lambda x : round(x/1000))
sp_tracks.drop('duration_ms', inplace = True, axis=1)
sp_tracks.duration.head()
sp_tracks['duration'] = sp_tracks['duration_ms'].apply(lambda x: round(x/1000)): This line of code creates a new column
called 'duration' in the 'sp_tracks' DataFrame. It calculates the duration in seconds by applying a lambda function to the
'duration_ms' column. The lambda function divides the 'duration_ms' value by 1000 and rounds it to get the duration in seconds.
sp_tracks.drop('duration_ms', inplace=True, axis=1): This line of code removes the original 'duration_ms' column from the
'sp_tracks' DataFrame. The inplace=True argument makes the change directly to the DataFrame.
sp_tracks.duration.head(): This line of code displays the first few values from the newly created 'duration' column in the 'sp_tracks'
DataFrame.
Visualization: Pearson Correlation Heatmap for Two Variables
td = sp_tracks.drop(['key','mode','explicit'], axis=1).corr(method =
'pearson')
plt.figure(figsize=(9,5))hmap = sns.heatmap(td, annot = True, fmt =
'.1g', vmin=-1, vmax=1, center=0, cmap='Greens', linewidths=0.1,
linecolor='black')
hmap.set_title('CorrelationHeatMap')hmap.set_xticklabels(hmap.g
et_xticklabels(), rotation=90
• td = sp_tracks.drop(['key', 'mode', 'explicit'], axis=1).corr(method='pearson'): This line
of code creates a correlation matrix by calculating Pearson correlation coefficients
between numeric columns in the 'sp_tracks' DataFrame. It drops the columns 'key',
'mode', and 'explicit' before calculating the correlations.
• plt.figure(figsize=(9, 5)): This line of code sets the figure size for the upcoming
heatmap visualization using Matplotlib.
• hmap.set_title('Correlation HeatMap'): This line of code sets the title for the heatmap
visualization.
• hmap = sns.heatmap(td, annot=True, fmt='.1g', vmin=-1, vmax=1, center=0,
cmap='Greens', linewidths=0.1, linecolor='black'): This line of code uses
Seaborn's heatmap() function to create a heatmap visualization of the
correlation matrix. It displays the correlation values as annotations, uses
a color map ('Greens') to represent the correlation strength,
and sets the range of correlation values to be between -1 and 1.
• hmap.set_xticklabels(hmap.get_xticklabels(), rotation=90): This line of
code rotates the x-axis labels of the heatmap for better readability.
Regression Plot of Loudness vs. Energy with Regression Line
plt.figure(figsize=(8,4))
sns.regplot(data=sample_sp, y='loudness', x='energy',
color='#054907').set(title='Regression Plot - Loudness vs Energy Correlation')
• plt.figure(figsize=(8, 4)): This line of code sets the figure size for the
upcoming visualization using Matplotlib.
• sns.regplot(data=sample_sp, y='loudness', x='energy', color='#054907'):
This line of code uses Seaborn's regplot() function to create a regression
plot. It visualizes the relationship between the 'loudness' and 'energy'
columns from the sample_sp DataFrame. The color='#054907' argument
sets the color of the plot.
• .set(title='Regression Plot - Loudness vs Energy Correlation'): This line of
code sets the title for the regression plot.
Line Graph: Duration of Songs Over Each Year
total_dr = sp_tracks.duration
fig_dims = (15,5)
fig, ax = plt.subplots(figsize=fig_dims)
fig = sns.barplot(x = years, y = total_dr, ax = ax, errwidth = False).set(title='Years vs
Duration')
plt.xticks(rotation=90)
• total_dr = sp_tracks.duration: This line of code creates a new
variable total_dr and assigns the values from the 'duration' column of the
'sp_tracks' DataFrame to it.
• fig_dims = (15, 5): This line of code sets the dimensions of the figure for the
upcoming visualization.
• fig, ax = plt.subplots(figsize=fig_dims): This line of code uses Matplotlib to create
a subplot figure with the specified dimensions. It returns two variables: fig (the
figure) and ax (the axis).
• fig = sns.barplot(x=years, y=total_dr, ax=ax, errwidth=False): This line of code uses
Seaborn's barplot() function to create a bar plot. It plots the 'years' on the x-axis
and 'total_dr' (duration) on the y-axis, using the provided axis ax.
The errwidth=False argument disables error bars.
• .set(title='Years vs Duration'): This line of code sets the title for the bar plot.
plt.xticks(rotation=90): This line of code rotates the x-axis tick labels for better
readability.
Bar Plot: Top Five Genres by Popularity
sns.set_style(style='darkgrid')
plt.figure(figsize=(8,4))
Top = sp_feature.sort_values('popularity', ascending=False)[:10]
sns.barplot(y = 'genre', x = 'popularity', data = Top).set(title='Genres by
Popularity-Top 5')
• sns.set_style(style='darkgrid'): This line of code sets the style of the Seaborn plots to
'darkgrid' style, which Includes a dark grid in the background of the plot.
• plt.figure(figsize=(8, 4)): This line of code sets the figure size for the upcoming
visualization using Matplotlib.
• Top = sp_feature.sort_values('popularity', ascending=False)[:10]: This line of code
creates a new DataFrame Top by sorting the 'sp_feature' DataFrame in descending order
based on the 'popularity' column and selecting the top 10 rows.
• sns.barplot(y='genre', x='popularity', data=Top).set(title='Genres by Popularity-Top 5'):
This line of code uses Seaborn's barplot() function to create a bar plot. It plots the 'genre'
on the y-axis and 'popularity' on the x-axis from the Top DataFrame. The .set() function
sets the title for the plot.
THANK YOU!