Question # 01
(a) (i)
To clean the "Release_date" field, we can use the following method:
Examine all of the various dates in the "Release_date" field, focusing solely on dates, years,
and any data that is missing.
Subsequent to recognizing conflicting information, we will eliminate the missing information
in the "Release_date" field.
If the data only include the year, we will assume that the release date and month are the first
day of that year's first month.
Any dates that cannot be changed will be removed. Six rows from the
Spotify_database_Ireland file have been removed because they contain 1957-09 data in this
format. Therefore, we are unable to comprehend that 09 is a month or a date. This is why we
removed these six rows.
After that, add the cleaned dates to the database's "Release_date" field and verify that some of
the cleaned dates are correct.
We will achieve the consistent and accurate "Release_date" field by implementing this strategy.
(ii)
We can employ the following strategy to deal with the problem of missing values in the fields
"anger," "anticipation," "disgust," "fear," "joy," "sadness," "surprise," "trust," "n words," and
"LDA Topic":
• As can be seen, the order of the missing values in each of the aforementioned fields is the same.
So we will erase entire lines to settle this issue.
• However, unlike other fields, the sequence of missing values in the "Released after 2017" field
is unique. Consider putting in the missing values for the "Released after 2017" field that is
blank because of missing data.
• This can be accomplished in a number of ways, such as by estimating the missing values using
a predictive model or by using the average or median of the available data.
• Thus, the missing values are straight out (i.e., not mathematical), we can consider utilizing a
mode-based ascription strategy. This includes supplanting the missing values with the most
widely recognized esteem in the accessible information.
We will be able to resolve the issue of missing values in the "anger," "anticipation," "disgust,"
"fear," "joy," "sadness," "surprise," "trust," "n words," "LDA Topic," and "Released after 2017"
fields by implementing this strategy.
(b)
Table 1: Average, Standard deviation, min, max, std error of Popularity
GENRE MAX OF MIN OF AVERAGE OF STD DEV OF STD ERROR OF
POPULARITY POPULARITY POPULARITY POPULARITY POPULARITY
BOY BAND 67830.35 2.4 2421.112185 8059.897219 738.8495667
COUNTRY 3359.9 3.2 568.2384615 941.7547279 261.1957662
DANCE/ELE- 81863.8 1.6 5168.688589 12666.91456 815.9476947
CTRONIC
ELSE 64696.8 1.6 2812.230882 7650.813274 1710.773856
FUNK 10184.9 4 699.2525 2246.565392 502.3472933
HIP HOP 87573.85 0.8 3767.849398 10033.16676 449.596953
HOUSE 60212.15 2.4 6372.150855 10946.22717 1011.97906
INDIE 51539.6 0.8 4716.740878 9399.304208 772.6180466
K-POP 135264.2 8 13466.55588 25668.16525 3594.26206
LATIN 8764.6 4.8 2189.391667 3502.113001 1429.731646
METAL 1740.95 1.6 165.5873016 311.269691 39.21629491
POP 146629.4 0.8 6830.350535 14415.4964 361.6326662
R&B/SOUL 66495.2 2.4 5762.954717 11469.53011 909.5929038
RAP 81093.8 1.6 6937.373746 14598.22138 844.2369407
REGGAE 352 10.4 190.4 171.541715 99.03965536
ROCK 62891.85 1.6 1634.642138 5604.730633 314.297687
TRAP 19773.7 5.6 3666.375926 6508.351545 1252.532839
Table 2: Average, Standard deviation, min, max, std error of Artist_followers
GENRE AVERAGE OF MAX OF ARTIST_ MIN OF ARTIST_ STD DEV OF STD ERROR OF
ARTIST_ FOLLOWERS FOLLOWERS ARTIST_ ARTIST_
FOLLOWERS FOLLOWERS FOLLOWERS
BOY BAND 7754882 18426467 69457 7166799 656979.4
COUNTRY 1374046 4289740 18580 1386892 384654.7
DANCE/ELE- 4149964 27670203 3729 6734846 433829.6
CTRONIC
ELSE 1084989 8310347 6 1829418 140310
FUNK 3191166 9903432 43767 3237713 723974.6
HIP HOP 14722451 50593376 16745 18096348 810916.8
HOUSE 6868570 24925809 5580 8375604 774324.8
INDIE 4977709 27701635 86735 9647260 792999.9
K-POP 5919622 24755789 756869 7886430 1104321
LATIN 11116259 26265604 2221744 10600946 4327818
METAL 9305144 18354552 315426 6814627 858562.3
POP 12941298 71783101 1897 16609199 416664.7
R&B/SOUL 8094196 25562055 31441 9515997 754667.7
RAP 11802712 29173640 14818 9150355 529178.7
REGGAE 3532420 8781395 246452 4593621 2652129
ROCK 6936113 30723081 7644 7961139 446438.5
TRAP 1824558 2441830 76126 1027723 197785.3
Cardinality of the Popularity and Artists_followers fields split by genre is many-to-many
relationship.
(c)
(i) State your hypotheses clearly.
Null Hypothesis: There will be no significant difference in popularity between the house and
dance/electronic genres for the new album produced by the independent music recording company.
Alternative Hypothesis: The popularity of the house genre for the new album produced by the
independent music recording company will be significantly greater than the popularity of the
dance/electronic genre.
(ii) State what type of hypothesis test you plan to use and why.
A one-tailed, two-sample t-test is the most appropriate method for testing the hypothesis that house
is more popular than dance or electronic.
The new album's genre, which can be house or dance/electronic, is the independent variable. The
new album's popularity is the dependent variable.
We plan to use a one-tailed independent samples t-test to compare the mean popularity of the two
genres. We choose the one-tailed test because we are interested in testing if the popularity of house
genre is greater than the popularity of dance/electronic genre.
(iii) Carry out the test and comment its results.
Levene's Test for
Equality of
Variances
Significance
One- Two-
F Sig. t df Sided p Sided p
Popularity Equal 0.024 0.877 -0.880 356 0.190 0.379
variances
assumed
Equal -0.926 262.274 0.178 0.355
variances not
assumed
As we see that p-value > 0.05 level of significance, so we accept the null hypothesis i.e., There
will be no significant difference in popularity between the house and dance/electronic genres for
the new album produced by the independent music recording company. Hence, we conclude that
the popularity of dance/electronic and house genre is same.
Question # 02
(a)
Data visualization is the process of visualizing data and information in formats like graphs, charts,
and maps. This makes it easier for people to understand complex data and find patterns and insights
that may not be immediately apparent.
Dashboards that work well and are easy to use are important for a number of reasons. First, they
give a clear and concise overview of the data. This makes it easy for users to quickly find trends,
patterns, and outliers and make decisions based on the data that are well-informed. Second,
dashboards make it easier for non-technical stakeholders to comprehend the data and gain insight
by providing a simple and visual way to convey complex information. Thirdly, dashboards enable
users to interact with the data in real time, allowing them to examine the data in greater depth and
drill down into specific areas of interest. Finally, user-friendly dashboards have the potential to
enhance the user experience as a whole, increase data and insight adoption and engagement, and
ultimately improve decision-making.
In conclusion, effective dashboards and data visualization are essential tools for understanding
complex data, recognizing patterns and insights, and communicating key findings to stakeholders.
Dashboards can assist users in making informed decisions based on the data by presenting it in a
clear and understandable manner, resulting in improved performance and outcomes.
(b)
(c)
(d)
In this line chart:
Certain genres are showing consistently higher or lower popularity over time.
Some genres are showing an unexpected rise or fall in popularity over time.
In this line chart:
Non-explicit songs have a different trend in popularity compared to explicit songs.
The popularity of explicit songs shows a sudden rise or fall over time due to external
factors.
In this line chart:
Certain LDA topics have consistently become more popular over time.
Certain LDA topics have consistently higher or lower popularity over time.
(e)
This stacked bar chart reveals the overall trend in the number of songs in the Top10 position, year-
wise changes, and genre distribution. This can help music industry professionals understand
market trends and inform their strategies.
This stacked bar chart provides insights into the trends and genre distribution of songs that have
reached the Top50 position over time. This information can be used to inform marketing and
investment strategies in the music industry.
(f)
If a genre has a smaller box plot with a lower median value, it could indicate that songs in that
genre tend to have lower maximum positions. Alternatively, if a genre has a larger box plot with a
higher median value, it could indicate that songs in that genre tend to have higher maximum
positions. Similarly, differences in the box plots of different topics within a genre could indicate
variations in the popularity and success of songs within that genre.
Question #03
(a)
The process of extracting insights from graphical representations of data, such as maps and charts,
typically involves a series of steps. Firstly, it is necessary to carefully examine the visualization
and identify the key data points being represented. This involves analyzing labels, scales, and
legends, as well as colors and shapes used to represent the data.
After identifying the key data points, the next step is to ask questions to gain insights into the data.
Questions can fluctuate contingent upon the sort of perception and the information being
addressed. For example, questions might include: What trends or patterns can be observed in the
data? Are there any outliers or anomalies that need to be investigated further? What relationships
or correlations exist between different data points? What are the possible causes or drivers of
observed trends or patterns?
To extract insights effectively, it is essential to understand the context in which the data was
collected and processed. Therefore, it is important to document the data mining process, including
the data sources, any pre-processing steps, and any assumptions or limitations that were made.
Furthermore, it is critical to document the methods used to analyze the data and extract insights,
including any statistical techniques, algorithms, or machine learning models used. This
documentation should include visualizations that were created, as well as any interpretations or
conclusions drawn from the data.
Finally, it is necessary to communicate the insights effectively to stakeholders, using visualizations
and clear explanations to help them understand the findings. In summary, extracting insights from
graphical representations of data involves careful analysis, asking the right questions, documenting
the data mining process and methods used, and effective communication of the findings to
stakeholders.
(b)
There are several types of models that can be used for predictive analysis, including:
Regression Models: These models are used to predict continuous numerical values, such as
stock prices or temperatures. Regression models use statistical techniques to identify
relationships between independent and dependent variables, and then create a formula that can
be used to make predictions. Simple linear regression (SLR), Multiple linear regression
(MLR), Polynomial regression and Logistic regression are some regression models.
Classification Models: These models are used to predict categorical outcomes, such as whether
a customer will buy a product or not. Classification models use algorithms to identify patterns
in the data and then assign labels to new data based on these patterns. K- nearest neighbors,
Naïve Bayes (NB), SVM and random forest (RF) are some classification models.
Time Series Models: These models are used to analyze data over time, such as stock prices or
website traffic. Time series models use statistical techniques to identify trends, patterns, and
seasonal variations in the data and then make predictions based on these patterns.
Autoregressive, moving average, ARIMA model, SARIMA model, VAR model and BSTS
model are some time series models.
Clustering Models: These models are used to identify groups of similar data points. Clustering
models use algorithms to group data points together based on their similarity, and then assign
labels to these groups. Fuzzy clustering, subspace clustering, hierarchical clustering and k-
means clustering are some clustering models.
Benefits of carrying out predictive data analysis include:
Personalized Recommendations: By using predictive analysis to examine user habits and
patterns, Spotify is able to create customized recommendations for users depending on their
playing interests. This may increase user retention and engagement.
Music Discovery: Predictive analysis can also help Spotify discover new music based on user
preferences and behavior. This can help the platform expand its library and offer users more
diverse content.
Targeted Marketing: Predictive analysis can help Spotify identify user segments based on their
behavior, preferences, and demographics. This can enable the platform to target specific groups
with personalized marketing campaigns, improving conversion rates and ROI.
Predictive Maintenance: Predictive analysis can also be used to identify potential issues with
the platform's infrastructure, allowing Spotify to perform proactive maintenance and prevent
downtime.
Improved Business Performance: By using predictive analysis to identify trends and patterns
in user behavior, Spotify can make data-driven decisions that improve its business
performance. For example, it can use predictive analysis to identify popular artists, genres, or
songs, and invest in acquiring more content in those areas.
Competitive Advantage: By using predictive analysis to understand user behavior and
preferences, Spotify can gain a competitive advantage over other music streaming platforms.
This can enable the platform to differentiate itself by offering more personalized
recommendations, a better user experience, or a more diverse library of content.
(c)
There are many mathematical functions that can be used to capture trends in data, but some of the
most popular ones are:
Linear Regression: This function is used to model the association between a response variable
and more than one regressors. It assumes that the relationship is linear, and it finds the best-fit
line that represents the data.
Polynomial Regression: This function is similar to linear regression, but it allows for more
complex relationships between a response variable and regressors. It models the data with a
polynomial equation of a specified degree, which can capture non-linear trends in the data.
Exponential Functions: These functions are used to model data that shows exponential growth
or decay over time. They are often used in finance and economics to model interest rates,
population growth, or the spread of diseases.
Logarithmic Functions: These functions are used to model data that shows a diminishing rate
of change over time. They are often used in engineering, physics, and finance to model
phenomena such as resistance, signal attenuation, or stock prices.
Sigmoid Functions: These functions are used to model data that shows a saturation effect or an
S-shaped curve. They are often used in biology and neuroscience to model the response of
neurons or the growth of organisms.
Fourier Transform: This function is used to decompose a signal into its component frequencies.
It can be used to analyze periodic data such as sound waves, electrical signals, or climate data.
Wavelet Transform: This function is similar to Fourier Transform, but it can analyze non-
periodic signals and capture transient features in the data. It is often used in image processing,
signal processing, and geophysics.
Overall, these mathematical functions are widely used to capture trends in the data and model
complex relationships between variables. The choice of function depends on the type of data and
the research question at hand.
In my opinion, the 2nd degree polynomial trend line best captures the historical trend in the
variation of the songs' energy levels over time. This is because the polynomial trend line shows a
consistent increase in energy levels over time, which is a reasonable trend to expect given the
increasing popularity of electronic dance music over the years. Additionally, the R-squared value
for the polynomial trend line is higher than the other trend lines, indicating that it explains more
of the variation in the data. To prove this, I would compare the R-squared values of each of the
four trend lines and choose the one with the highest R-squared value as the best fit for the data.
The R-squared values of four trend lines are following as:
Trend lines R-squared values
polynomial 0.6247
exponential 0.5425
logarithmic 0.5266
linear 0.5241
Q3: (d)
(b)
(c)
(d)
The stacked bar charts show the number of positive and negative songs across different genres for
different values of the "Factor" parameter.
When the factor is set to 1, the impact of speech on the sentiment of a song is minimal, meaning
that the sentiment of the song remains mostly positive, even when the speech level is high. As a
result, for a factor of 1, the stacked bar chart shows that the majority of songs in each genre are
positive, while there are very few negative songs. However, as the factor increases, the impact of
speech on the sentiment of a song becomes more pronounced, and the sentiment of the song
becomes increasingly negative as the speech level increases. This is evident in the stacked bar
chart, where the number of negative songs increases as the factor increases. For example, when
the factor is set to 2.5, there are more negative songs than for a factor of 1, but the majority of
songs in each genre are still positive. However, when the factor is set to 5.0, the majority of songs
in each genre become negative, indicating that the impact of speech on the sentiment of a song is
now significant enough to cause the sentiment of the song to become negative, even at relatively
low speech levels.