Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views58 pages

Stats Presentation

The document describes a presentation on analyzing music sales data of the best selling artists of all time. It includes an overview of topics to be covered, definitions of key terms like statistics and parameters, and the methodology which involves exploring and cleaning the dataset.

Uploaded by

imaaniashannah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views58 pages

Stats Presentation

The document describes a presentation on analyzing music sales data of the best selling artists of all time. It includes an overview of topics to be covered, definitions of key terms like statistics and parameters, and the methodology which involves exploring and cleaning the dataset.

Uploaded by

imaaniashannah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Presentation by Javin Buchanan & Imaani DaCosta

University of the West Indies, Mona Campus Probability & Statistics Project | 2023

Statistics
Project
Javin Buchanan & Imaani DaCosta

Overview
A breakdown of the topics we are going to
cover in our presentation today!

04 Abstract 08 Results

05 Definition of Terms 09 Conclusion

06 Introduction 10 Question & Answer Session

07 Methodology 11 Thank you


Javin Buchanan & Imaani DaCosta

Abstract
What is our dataset about?
Our dataset is titled " Best Selling music
Artists of All Time" covers the music sales of
the 121 best selling artists of all time across
various genres, countries and spans of
careers.
Who is responsible for tracking sales?
The Recording Industry Association of America
(RIAA)
How can sales be categorized quantitatively?
Total Certified Units (TCU) and Claimed Sales
Presentation by J. Buchanan and
I. DaCosta
Definitions
What is the difference between a statistic and a parameter?
A statistic is simply any calculation performed on a sample of a population,
while a parameter is a calculation performed on a population.

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
Total Certified Units vs Claimed Sales
"Claimed Sales" are sales that get mentioned by people like Billboard in their
articles, for example Billboard claim ARTPOP sold 2.5 million copies, but there
aren't RIAA statistics to back it up. "Certified Units" are sales that have been
sent to companies like RIAA to get a certification, such as platinum, and are
actually proven to be real sales

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
What is conditional probability?
The probability that an event, such as music sales being greater than 500
million units, occurs given another event has occurred, such as the artist
being from the Pop Genre.

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
What is normal distribution?

Normal distribution refers to a probability where most values in the


dataset cluser towards the mean of the dataset, while the other values
represent both sides of the extreme.

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Example of Normal Distribution
Hypothesis
The hypothesis focused on for this dataset is that an artist who is
located in the more mainstream areas of society, both geographically-
the United States, and musically- Pop music, would more likely be
successful in the music industry when compared to an artist in the
music industry outside of these areas.

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Problem Statements
PS 1 PS 2
The total certified units for 50 artists are normally
Using the entire population as the sample, we can distributed with a sample mean of ‘x’ and a sample
see that 65% of the artists in this dataset are from standard deviation of ‘y’. The threshold to be
the United States. What is the probability that a considered a ‘highly successful artist’ is ‘P’. Estimate
randomly selected artist from the United States the number of artists who qualify for this title within
the Rock genre. Do the same for the Pop genre and
has sold more than 160 million certified units and
compare which genre would there more than likely be
has a period of activity lasting less than 15 years?
a highly successful artist.
Methodology
EXPLORING THE DATASET

Presentation by Estelle Darcy


Methodology
EXPLORING THE DATASET

Retrieving the dataset from online


01

data<-
read.csv("https://raw.githubusercontent.com/JavinBuchanan/Best-
Selling-
Artists/main/Raw%20DataSet%20for%20Best%20Selling%20Artists.csv")

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET

Finding the amount of observations and variables


02

dim(data)

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET

Viewing the Data Types of Variables in R


03

str(data)

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET

Checking if there is missing data in R


04

#Finding out if there any data with N/As

sapply(data, function(m){
sum(is.na(m))
})

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology CLEANING THE DATASET

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
METHODOLOGY
CLEANING THE DATASET

In an attempt to make the


data$TCU<- columns uniform, the irregularities
gsub("million","",as.character(data$TCU)) and characters that aren't
data$Sales<- necessary were removed from
gsub("million",".",as.character(data$Sales)) certain columns.
data$Sales<-
gsub("\\..*","",as.character(data$Sales))
"million" removed from TCU and

data$Genre<- Sales column

gsub("\\/.*","",as.character(data$Genre))
data$Artist <- gsub("é",
"",as.character(data$Artist))
"million" removed from Sales and
Genre column
METHODOLOGY
CLEANING THE DATASET

The original dataset provided


data[9, "period_active"] <- "1965-2014" multiple active periods for certain
data[15, "period_active"] <- "1971-present" artists. This made the data hard
data[29, "period_active"] <- "1980-present" to sort and removed the
data[31, "period_active"] <- "1972-present" possibility for easy calculations.
data[32, "period_active"] <- "1935-1995"
data[45, "period_active"] <- "1963-2012"
data[77, "period_active"] <- "1967-present"
For select lines the multiple
data[117, "period_active"] <- "1977-2008" ranges were just changed to
their starting and ending year
METHODOLOGY
CLEANING THE DATASET

The dataset had to be edited


data$period_active<- further seeing as "present" is not a
gsub("present","2022",as.character(data$period qualitative entry. All lines that had
_active)) "present" were changed to "2022"
library(dplyr) seeng as that is when the dataset
data[,"Genre", drop = FALSE] was last updated
data <-
data %>%
Code that reads all elements and changes
mutate(Genre = as_factor(Genre)) "present" to 2022 as well as change the
datatype from character to integer

data$`TCU` <- as.numeric(data$`TCU`)


# Convert TCU to numeric
str (data)
METHODOLOGY
CLEANING THE DATASET

data[9, "Genre"] <- "Rock" data[64, "Genre"] <- "Rock"


data[2, "Genre"] <- "Rock" data[68, "Genre"] <- "Rock"
The original dataset also provided
data[7, "Genre"] <- "Rock" data[76, "Genre"] <- "Rock" multiple genre assignments to a
data[17, "Genre"] <- "Rock" data[77, "Genre"] <- "Rock" significant of artists which made
data[25, "Genre"] <- "Rock" data[78, "Genre"] <- "Rock" organisation difficult.
data[28, "Genre"] <- "Rock" data[92, "Genre"] <- "Rock"
data[39, "Genre"] <- "Rock" data[96, "Genre"] <- "Pop"
data[41, "Genre"] <- "Rock" data[99, "Genre"] <- "Rock"
data[43, "Genre"] <- "Rock" data[104, "Genre"] <- "Rock"
data[47, "Genre"] <- "Rock" data[110, "Genre"] <- "Rock"
data[48, "Genre"] <- "Rock" data[113, "Genre"] <- "Rock" Research was done to determine the artist's
primary genre association and then the code

data[53, "Genre"] <- "Pop" data[118, "Genre"] <- "Rock" was made to show only the main genre.

data[54, "Genre"] <- "Rock" data[120, "Genre"] <- "Rock"


METHODOLOGY
CLEANING THE DATASET
For ease of documentation and
calculation the "Years active"
# Split name column into start year and end year column was split into "start year"
data <- data %>% separate(period_active, c('Start Year', 'End
and "end year" and the original
Year'))
column removed.
#Separation of period active column Making the new 'start year' and
'end year' columns
data$`Start Year` <- as.numeric(data$`Start Year`) # Convert
Start Year to numeric
data$`End Year` <- as.numeric(data$`End Year`) # Convert End
Year to numeric
Splitting the data in the 'years
active' column into the new
columns

data$'Years Active' <- (data$'End Year' - data$'Start Year')


#changing position of column 'Years Active'
Chnaging 'years active' column into the
data <- data %>% relocate('Years Active', .before = Year) number of years the artist has been active
and putting it after the 'start year' and 'end
year' columns
BEFORE
AFTER
Methodology CENTRAL TENDENCIES

Presentation by University of the West Indies


Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Methodology
Measures of Central Tendencies
01

Mean, Mode and Median Mean = 104.6 million

Normal Distribution
datamean<- mean(data$TCU) Median = 76.6 million
datamedian<-median(data$TCU)
# Define the Mode function
Mode <- function(x) {
ux <- unique(x) Mode = 99.8 million

ux[which.max(tabulate(match(x, ux)))]
}
# Find the mode of the column Presentation by Estelle Darcy
datamode <- Mode(data$TCU)
Visualizing the Data
BOX AND WHISKER PLOT

boxplot(data$TCU,
main = "Box and Whisker Plot showing
Total Certified Units",
xlab = "Total Certified Units",
ylab = "",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Problem Statement 1
CONDITIONAL PROBABILITY

Using the entire population as the sample, we can see that 65% of the artists in this
dataset are from the United States. What is the probability that a randomly selected
artist from the United States has sold more than 160 million certified units and has a
period of activity lasting less than 15 years?
Problem Statement 1
CONDITIONAL PROBABILITY

This conditional probability has 3 events.


usfilter<- filter(data, Country=='United
States') Event A = the event that the artist
view(usfilter) chosen is from the United States
bothfilters<- filter(usfilter, TCU>160)
view(bothfilters) Event B= the event that the artist chosen
threefilters<-filter(bothfilters, has Event A and over 160 million TCU's
bothfilters$`Years Active`<15)
view(threefilters)
Event C= the event that the artist
chosen has Event B and has been
practiting for less than 15 years.
Original Population

NB: These results extend all the way down to 121 Artists!
Event A

NB: These results include 79 artists!


Event B

NB: These results include 14 artists!


Event C

NB: These results include 1 artist!


R Code for Conditional Probability
P(B ∩ C | A) = P(B | A) * P(C | A ∩ B)
answerforprob1=
probabilitycgivenaandb*bgivena

bgivena <- aintersectb/probabilityus

probabilitycgivenaandb<- probabilitycandaandb/bgivena
R Code for Conditional Probability
P(B | A) = P(B ∩ A)/P(A) Represents the number of artists
who have sold more than 160
million units given they are from the
P(B | A) = 14/79
United States

∩B=D
Let A Represents the number of artists

P(C |A∩ B) = P(C|D)


who have made music for less than
15 years given they have sold more
P(C |D) = P(D ∩ C)/P(D) than 160 million units and they are
from the United States
P(C |D) = 1/14
Answer for Problem Statement 1
P(B ∩ C | A) = P(B | A) * P(C | A ∩ B)
P(B ∩ C | A) = 14/79 * 1/14

P(B ∩ C | A) = 1/79
Problem Statement 2 Normal Distribution
The total certified units for 50 artists are normally distributed with a sample mean of ‘x’
and a sample standard deviation of ‘y’. The threshold to be considered a ‘highly
successful artist’ is ‘P’. Estimate the number of artists who qualify for this title within the
Rock genre. Do the same for the Pop genre and compare which genre would there
more than likely be a highly successful artist.

Presentation by Estelle Darcy


Problem Statement 2
NORMAL DISTRIBUTION

rand_sample<- data[sample(nrow(data), 50), ]


View(rand_sample)

This code first generates a table filled


with a random selection of 50 members
of the original dataset. There are no
initial criteria just a total of 50.
Problem Statement 2
NORMAL DISTRIBUTION
This table is then filtered and
split into 2 sub-tables: one for
Pop and one for Rock.
pop_filter<- filter(rand_sample,
rand_sample$Genre =='Pop' |
Filter code for pop
rand_sample$Genre =='Pop ')
View(pop_filter)

rock_filter<- filter(rand_sample,
rand_sample$Genre =='Rock' |
Filter code for rock
rand_sample$Genre =='Rock ')
View(rock_filter)
Problem Statement 2
NORMAL DISTRIBUTION
The mean, median and
standard deviation for both pop
and rock were found
pop_mean<-mean(pop_filter$TCU)
print(pop_mean)
Mean

pop_median<-median(pop_filter$TCU)
print(pop_median) Median

pop_standev<- sd(pop_filter$TCU)
print(pop_standev)
Standard Deviation
Problem Statement 2
NORMAL DISTRIBUTION

x_value<-160
pop_z<- (x_value - pop_mean)/pop_standev
print(z)

The z score was then found using the


mean, the standard deviation and the
threshold x value which for this question
is 160 (million)
Problem Statement 2
NORMAL DISTRIBUTION

pop_prob <- pnorm(pop_z, mean = 0, sd = 1,


lower.tail = FALSE)
print(pop_prob)

The calculated z score is then used to


calculate the probability that the chosen
artist has a probability less than 160
(million)
Problem Statement 2
NORMAL DISTRIBUTION

morethan_popprob<- 1- pop_prob

However, since we need the probability


that the chosen artist has a TCU greater
than 160 (million), we then subtract the
probability found from one.
Problem Statement 2
NORMAL DISTRIBUTION

pop_count<-nrow(pop_filter)
print(pop_count)

pop_estim<-(morethan_popprob * pop_count)
print(pop_estim)

To expand this probability to the entire


sample, the found probability is then
multiplied by the number of elements.
Histogram and distribution curve for Pop

Mean: 115.7583
Median: 89.75
Standard Deviation: 78.15496
Variance: 6108.19777
Histogram and distribution curve for Rock

Mean: 80.28824
Median: 62.1
Standard Deviation: 48.08145
Variance: 2311.8258
DISCUSSION
PROBLEM STATEMENT 1
DISCUSSION
We used conditional probability to closely investigate very specific
groups within the dataset, such as those artists who have been making
music in the United States.

As the results show, 65% of the best-selling artists are from the United States
and as such provided the perfect group to base our conditional probability
around as they are the majority of the dataset.
DISCUSSION
The term 'successful' has no set definition and as such, we had to quantify it -
setting the threshold at 160 million units. Conditional probability allows us to
delve further into this data as we can now calculate how many of those artists
have sold >160 million units given that they are from the United States

This probability represents the intersections of datasets, and can subsequently


be used to calculate conditional probability!
DISCUSSION
The probability that an artist has sold 160 million Total Certified Units given that
they are from the United States is 18%.

The data was then filtered or conditioned to these artists who have been making
music for less than 15 years, and only 1.2% of the artists can claim this feat.
DISCUSSION
PROBLEM STATEMENT 2
Discussion in document
From the random sample ‘Pop’ and ‘Rock’ were chosen as they were the two
most frequented genres in the dataset. Even though the ‘Pop’ genre was less
frequent in this sample than the ‘Rock’ Genre, the results throughout were
still in Pop’s favour. The final estimations for Pop and Rock were 3.428052
and 0.8274584 respectively.

This deduction was made with the calculations for the probability
that the artist chosen had a TCU less than 160. However the
intention was to find the probability that the TCU was greater
than 160.
Correct Discussion
The frequency of the Pop artists versus the Rock artists is proportional to their
probabilities of having a TCU above 160. For this random sample Pop's probability
was 0.5929348 and Rock's probability was 0.8377515.

This shows that the probability of having a 'more successful' career is not directly
related to the genre the artist is in. Even though from the overall population, there
are more Pop Artists which means their may be a higher chance of having a TCU
over 160, that was not the case for the sample.

Presentation by J. Buchanan and


I. DaCosta
Correct Discussion
The standard deviation for Pop and Rock were 78.15 And 48.08 respectively. With
this, it indicates that the total certified units for Pop have a wider distribution from
the mean than Rock and with Pop having the higher average, it shows overall
promise of success in this genre

Presentation by J. Buchanan and


I. DaCosta
Conclusion
By deploying the models of normal distribution and conditional
probability, along with the methods of central tendency, problem
statement 1 supported our hypothesis while problem statement 2 did not.

You might also like