Presentation by Javin Buchanan & Imaani DaCosta
University of the West Indies, Mona Campus Probability & Statistics Project | 2023
Statistics
Project
Javin Buchanan & Imaani DaCosta
Overview
A breakdown of the topics we are going to
cover in our presentation today!
04 Abstract 08 Results
05 Definition of Terms 09 Conclusion
06 Introduction 10 Question & Answer Session
07 Methodology 11 Thank you
Javin Buchanan & Imaani DaCosta
Abstract
What is our dataset about?
Our dataset is titled " Best Selling music
Artists of All Time" covers the music sales of
the 121 best selling artists of all time across
various genres, countries and spans of
careers.
Who is responsible for tracking sales?
The Recording Industry Association of America
(RIAA)
How can sales be categorized quantitatively?
Total Certified Units (TCU) and Claimed Sales
Presentation by J. Buchanan and
I. DaCosta
Definitions
What is the difference between a statistic and a parameter?
A statistic is simply any calculation performed on a sample of a population,
while a parameter is a calculation performed on a population.
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
Total Certified Units vs Claimed Sales
"Claimed Sales" are sales that get mentioned by people like Billboard in their
articles, for example Billboard claim ARTPOP sold 2.5 million copies, but there
aren't RIAA statistics to back it up. "Certified Units" are sales that have been
sent to companies like RIAA to get a certification, such as platinum, and are
actually proven to be real sales
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
What is conditional probability?
The probability that an event, such as music sales being greater than 500
million units, occurs given another event has occurred, such as the artist
being from the Pop Genre.
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Definitions
What is normal distribution?
Normal distribution refers to a probability where most values in the
dataset cluser towards the mean of the dataset, while the other values
represent both sides of the extreme.
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Example of Normal Distribution
Hypothesis
The hypothesis focused on for this dataset is that an artist who is
located in the more mainstream areas of society, both geographically-
the United States, and musically- Pop music, would more likely be
successful in the music industry when compared to an artist in the
music industry outside of these areas.
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Problem Statements
PS 1 PS 2
The total certified units for 50 artists are normally
Using the entire population as the sample, we can distributed with a sample mean of ‘x’ and a sample
see that 65% of the artists in this dataset are from standard deviation of ‘y’. The threshold to be
the United States. What is the probability that a considered a ‘highly successful artist’ is ‘P’. Estimate
randomly selected artist from the United States the number of artists who qualify for this title within
the Rock genre. Do the same for the Pop genre and
has sold more than 160 million certified units and
compare which genre would there more than likely be
has a period of activity lasting less than 15 years?
a highly successful artist.
Methodology
EXPLORING THE DATASET
Presentation by Estelle Darcy
Methodology
EXPLORING THE DATASET
Retrieving the dataset from online
01
data<-
read.csv("https://raw.githubusercontent.com/JavinBuchanan/Best-
Selling-
Artists/main/Raw%20DataSet%20for%20Best%20Selling%20Artists.csv")
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET
Finding the amount of observations and variables
02
dim(data)
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET
Viewing the Data Types of Variables in R
03
str(data)
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology
EXPLORING THE DATASET
Checking if there is missing data in R
04
#Finding out if there any data with N/As
sapply(data, function(m){
sum(is.na(m))
})
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Result
Methodology CLEANING THE DATASET
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
METHODOLOGY
CLEANING THE DATASET
In an attempt to make the
data$TCU<- columns uniform, the irregularities
gsub("million","",as.character(data$TCU)) and characters that aren't
data$Sales<- necessary were removed from
gsub("million",".",as.character(data$Sales)) certain columns.
data$Sales<-
gsub("\\..*","",as.character(data$Sales))
"million" removed from TCU and
data$Genre<- Sales column
gsub("\\/.*","",as.character(data$Genre))
data$Artist <- gsub("é",
"",as.character(data$Artist))
"million" removed from Sales and
Genre column
METHODOLOGY
CLEANING THE DATASET
The original dataset provided
data[9, "period_active"] <- "1965-2014" multiple active periods for certain
data[15, "period_active"] <- "1971-present" artists. This made the data hard
data[29, "period_active"] <- "1980-present" to sort and removed the
data[31, "period_active"] <- "1972-present" possibility for easy calculations.
data[32, "period_active"] <- "1935-1995"
data[45, "period_active"] <- "1963-2012"
data[77, "period_active"] <- "1967-present"
For select lines the multiple
data[117, "period_active"] <- "1977-2008" ranges were just changed to
their starting and ending year
METHODOLOGY
CLEANING THE DATASET
The dataset had to be edited
data$period_active<- further seeing as "present" is not a
gsub("present","2022",as.character(data$period qualitative entry. All lines that had
_active)) "present" were changed to "2022"
library(dplyr) seeng as that is when the dataset
data[,"Genre", drop = FALSE] was last updated
data <-
data %>%
Code that reads all elements and changes
mutate(Genre = as_factor(Genre)) "present" to 2022 as well as change the
datatype from character to integer
data$`TCU` <- as.numeric(data$`TCU`)
# Convert TCU to numeric
str (data)
METHODOLOGY
CLEANING THE DATASET
data[9, "Genre"] <- "Rock" data[64, "Genre"] <- "Rock"
data[2, "Genre"] <- "Rock" data[68, "Genre"] <- "Rock"
The original dataset also provided
data[7, "Genre"] <- "Rock" data[76, "Genre"] <- "Rock" multiple genre assignments to a
data[17, "Genre"] <- "Rock" data[77, "Genre"] <- "Rock" significant of artists which made
data[25, "Genre"] <- "Rock" data[78, "Genre"] <- "Rock" organisation difficult.
data[28, "Genre"] <- "Rock" data[92, "Genre"] <- "Rock"
data[39, "Genre"] <- "Rock" data[96, "Genre"] <- "Pop"
data[41, "Genre"] <- "Rock" data[99, "Genre"] <- "Rock"
data[43, "Genre"] <- "Rock" data[104, "Genre"] <- "Rock"
data[47, "Genre"] <- "Rock" data[110, "Genre"] <- "Rock"
data[48, "Genre"] <- "Rock" data[113, "Genre"] <- "Rock" Research was done to determine the artist's
primary genre association and then the code
data[53, "Genre"] <- "Pop" data[118, "Genre"] <- "Rock" was made to show only the main genre.
data[54, "Genre"] <- "Rock" data[120, "Genre"] <- "Rock"
METHODOLOGY
CLEANING THE DATASET
For ease of documentation and
calculation the "Years active"
# Split name column into start year and end year column was split into "start year"
data <- data %>% separate(period_active, c('Start Year', 'End
and "end year" and the original
Year'))
column removed.
#Separation of period active column Making the new 'start year' and
'end year' columns
data$`Start Year` <- as.numeric(data$`Start Year`) # Convert
Start Year to numeric
data$`End Year` <- as.numeric(data$`End Year`) # Convert End
Year to numeric
Splitting the data in the 'years
active' column into the new
columns
data$'Years Active' <- (data$'End Year' - data$'Start Year')
#changing position of column 'Years Active'
Chnaging 'years active' column into the
data <- data %>% relocate('Years Active', .before = Year) number of years the artist has been active
and putting it after the 'start year' and 'end
year' columns
BEFORE
AFTER
Methodology CENTRAL TENDENCIES
Presentation by University of the West Indies
Javin Buchanan & Probability and Statistics |
Imaani DaCosta 2023
Methodology
Measures of Central Tendencies
01
Mean, Mode and Median Mean = 104.6 million
Normal Distribution
datamean<- mean(data$TCU) Median = 76.6 million
datamedian<-median(data$TCU)
# Define the Mode function
Mode <- function(x) {
ux <- unique(x) Mode = 99.8 million
ux[which.max(tabulate(match(x, ux)))]
}
# Find the mode of the column Presentation by Estelle Darcy
datamode <- Mode(data$TCU)
Visualizing the Data
BOX AND WHISKER PLOT
boxplot(data$TCU,
main = "Box and Whisker Plot showing
Total Certified Units",
xlab = "Total Certified Units",
ylab = "",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Problem Statement 1
CONDITIONAL PROBABILITY
Using the entire population as the sample, we can see that 65% of the artists in this
dataset are from the United States. What is the probability that a randomly selected
artist from the United States has sold more than 160 million certified units and has a
period of activity lasting less than 15 years?
Problem Statement 1
CONDITIONAL PROBABILITY
This conditional probability has 3 events.
usfilter<- filter(data, Country=='United
States') Event A = the event that the artist
view(usfilter) chosen is from the United States
bothfilters<- filter(usfilter, TCU>160)
view(bothfilters) Event B= the event that the artist chosen
threefilters<-filter(bothfilters, has Event A and over 160 million TCU's
bothfilters$`Years Active`<15)
view(threefilters)
Event C= the event that the artist
chosen has Event B and has been
practiting for less than 15 years.
Original Population
NB: These results extend all the way down to 121 Artists!
Event A
NB: These results include 79 artists!
Event B
NB: These results include 14 artists!
Event C
NB: These results include 1 artist!
R Code for Conditional Probability
P(B ∩ C | A) = P(B | A) * P(C | A ∩ B)
answerforprob1=
probabilitycgivenaandb*bgivena
bgivena <- aintersectb/probabilityus
probabilitycgivenaandb<- probabilitycandaandb/bgivena
R Code for Conditional Probability
P(B | A) = P(B ∩ A)/P(A) Represents the number of artists
who have sold more than 160
million units given they are from the
P(B | A) = 14/79
United States
∩B=D
Let A Represents the number of artists
P(C |A∩ B) = P(C|D)
who have made music for less than
15 years given they have sold more
P(C |D) = P(D ∩ C)/P(D) than 160 million units and they are
from the United States
P(C |D) = 1/14
Answer for Problem Statement 1
P(B ∩ C | A) = P(B | A) * P(C | A ∩ B)
P(B ∩ C | A) = 14/79 * 1/14
P(B ∩ C | A) = 1/79
Problem Statement 2 Normal Distribution
The total certified units for 50 artists are normally distributed with a sample mean of ‘x’
and a sample standard deviation of ‘y’. The threshold to be considered a ‘highly
successful artist’ is ‘P’. Estimate the number of artists who qualify for this title within the
Rock genre. Do the same for the Pop genre and compare which genre would there
more than likely be a highly successful artist.
Presentation by Estelle Darcy
Problem Statement 2
NORMAL DISTRIBUTION
rand_sample<- data[sample(nrow(data), 50), ]
View(rand_sample)
This code first generates a table filled
with a random selection of 50 members
of the original dataset. There are no
initial criteria just a total of 50.
Problem Statement 2
NORMAL DISTRIBUTION
This table is then filtered and
split into 2 sub-tables: one for
Pop and one for Rock.
pop_filter<- filter(rand_sample,
rand_sample$Genre =='Pop' |
Filter code for pop
rand_sample$Genre =='Pop ')
View(pop_filter)
rock_filter<- filter(rand_sample,
rand_sample$Genre =='Rock' |
Filter code for rock
rand_sample$Genre =='Rock ')
View(rock_filter)
Problem Statement 2
NORMAL DISTRIBUTION
The mean, median and
standard deviation for both pop
and rock were found
pop_mean<-mean(pop_filter$TCU)
print(pop_mean)
Mean
pop_median<-median(pop_filter$TCU)
print(pop_median) Median
pop_standev<- sd(pop_filter$TCU)
print(pop_standev)
Standard Deviation
Problem Statement 2
NORMAL DISTRIBUTION
x_value<-160
pop_z<- (x_value - pop_mean)/pop_standev
print(z)
The z score was then found using the
mean, the standard deviation and the
threshold x value which for this question
is 160 (million)
Problem Statement 2
NORMAL DISTRIBUTION
pop_prob <- pnorm(pop_z, mean = 0, sd = 1,
lower.tail = FALSE)
print(pop_prob)
The calculated z score is then used to
calculate the probability that the chosen
artist has a probability less than 160
(million)
Problem Statement 2
NORMAL DISTRIBUTION
morethan_popprob<- 1- pop_prob
However, since we need the probability
that the chosen artist has a TCU greater
than 160 (million), we then subtract the
probability found from one.
Problem Statement 2
NORMAL DISTRIBUTION
pop_count<-nrow(pop_filter)
print(pop_count)
pop_estim<-(morethan_popprob * pop_count)
print(pop_estim)
To expand this probability to the entire
sample, the found probability is then
multiplied by the number of elements.
Histogram and distribution curve for Pop
Mean: 115.7583
Median: 89.75
Standard Deviation: 78.15496
Variance: 6108.19777
Histogram and distribution curve for Rock
Mean: 80.28824
Median: 62.1
Standard Deviation: 48.08145
Variance: 2311.8258
DISCUSSION
PROBLEM STATEMENT 1
DISCUSSION
We used conditional probability to closely investigate very specific
groups within the dataset, such as those artists who have been making
music in the United States.
As the results show, 65% of the best-selling artists are from the United States
and as such provided the perfect group to base our conditional probability
around as they are the majority of the dataset.
DISCUSSION
The term 'successful' has no set definition and as such, we had to quantify it -
setting the threshold at 160 million units. Conditional probability allows us to
delve further into this data as we can now calculate how many of those artists
have sold >160 million units given that they are from the United States
This probability represents the intersections of datasets, and can subsequently
be used to calculate conditional probability!
DISCUSSION
The probability that an artist has sold 160 million Total Certified Units given that
they are from the United States is 18%.
The data was then filtered or conditioned to these artists who have been making
music for less than 15 years, and only 1.2% of the artists can claim this feat.
DISCUSSION
PROBLEM STATEMENT 2
Discussion in document
From the random sample ‘Pop’ and ‘Rock’ were chosen as they were the two
most frequented genres in the dataset. Even though the ‘Pop’ genre was less
frequent in this sample than the ‘Rock’ Genre, the results throughout were
still in Pop’s favour. The final estimations for Pop and Rock were 3.428052
and 0.8274584 respectively.
This deduction was made with the calculations for the probability
that the artist chosen had a TCU less than 160. However the
intention was to find the probability that the TCU was greater
than 160.
Correct Discussion
The frequency of the Pop artists versus the Rock artists is proportional to their
probabilities of having a TCU above 160. For this random sample Pop's probability
was 0.5929348 and Rock's probability was 0.8377515.
This shows that the probability of having a 'more successful' career is not directly
related to the genre the artist is in. Even though from the overall population, there
are more Pop Artists which means their may be a higher chance of having a TCU
over 160, that was not the case for the sample.
Presentation by J. Buchanan and
I. DaCosta
Correct Discussion
The standard deviation for Pop and Rock were 78.15 And 48.08 respectively. With
this, it indicates that the total certified units for Pop have a wider distribution from
the mean than Rock and with Pop having the higher average, it shows overall
promise of success in this genre
Presentation by J. Buchanan and
I. DaCosta
Conclusion
By deploying the models of normal distribution and conditional
probability, along with the methods of central tendency, problem
statement 1 supported our hypothesis while problem statement 2 did not.