Basic description and visualisation of multivariate data in R
Categorical variates
Contingency tables
Load the Titanic data set from a csv file (note that dots were used to represent missing data).
Titanic <- read.csv("Titanic.csv",na.strings=".")
#
# Define Sex as factor (0 = Female, 1 = Male)
#
Titanic$Sex <- factor(Titanic$Sex,levels=c(0,1),labels=c("Female","Male"))
Joint distribution table (contingency table, cross tab, 2-way table): absolute frequencies.
table(Titanic$Sex,Titanic$Survived)
No Yes
Female 154 308
Male 708 142
Joint distribution table (contingency table): relative frequencies.
table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)
No Yes
Female 0.1172887 0.2345773
Male 0.5392232 0.1081493
See also ftable function.
Marginal distributions
Given two variates x, which takes values x1 , x2 , . . . , xn , and y, which takes values y1 , y2 , . . . , yn , the marginal
frequency distributions can be obtained as:
n
X n
X
f r(xi ) = f r(xi , yj ) f r(yj ) = f r(xi , yj )
j=1 i=1
Option 1: using table separately for each variate.
# Relative frequencies
table(Titanic$Survived)/length(Titanic$Survived)
No Yes
0.6565118 0.3427266
table(Titanic$Sex)/length(Titanic$Sex)
Female Male
0.3518660 0.6473724
1
Option 2: using addmargin on a contingency table.
tableTitanic <- table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)
addmargins(tableTitanic)
No Yes Sum
Female 0.1172887 0.2345773 0.3518660
Male 0.5392232 0.1081493 0.6473724
Sum 0.6565118 0.3427266 0.9992384
Conditional distributions
f r(xi , yj )
f r(xi |yj ) =
f r(yj )
Example: relative frequency of survival being a woman.
rel.f r(Survived = Y es, Sex = F emale)
rel.f r(Survived = Y es|Sex = F emale) = = 0.1671
rel.f r(Sex = F emale)
Conditional distributions of Survived given Sex:
prop.table(tableTitanic,1)
No Yes
Female 0.3333333 0.6666667
Male 0.8329412 0.1670588
The second argument of prop.table sets the dimension of the table which the conditioning variate is located
at. In this case Sex is located in the first dimension (rows).
Conditional distributions of Sex given Survived:
prop.table(tableTitanic,2)
No Yes
Female 0.1786543 0.6844444
Male 0.8213457 0.3155556
Relationship between pairs of categorical variates
The summary function can be used on a table class object in order to perform a Chi-square test of independence.
The null hypothesis is independence, no association.
summary(table(Titanic$Sex,Titanic$Survived))
Number of cases in table: 1312
Number of factors: 2
Test for independence of all factors:
Chisq = 331.5, df = 1, p-value = 4.444e-74
The p-value results to be extremely low and, hence, it provides support to conclude statistically significant
association between sex and survival.
2
Some graphical representations for categorical variates
Mosaics
plot(tableTitanic,main="Titanic",ylab="Survived",xlab="Sex",col=rainbow(2))
Titanic
Female Male
No
Survived
Yes
Sex
Using the HairEyeColor dataset included in R:
a <- as.table(HairEyeColor[,,"Male"]) # Extract table for Male category
a
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
plot(a,main="Hair and eye colour (Male)",ylab="Eye",xlab="Hair",col=rainbow(4))
3
Hair and eye colour (Male)
Black Brown Red Blond
Brown
Eye
Blue
Green Hazel
Hair
b <- as.table(HairEyeColor[,,"Female"]) # Extract table for Female category
b
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
plot(b,main="Hair and eye colour (Female)",ylab="Eye",xlab="Hair",col=rainbow(4))
Hair and eye colour (Female)
Black Brown Red Blond
Brown
Eye
BlueHazel
Green
Hair
4
Combined barplots
barplot(t(b),main="Hair and eye colour (Female)",col=rainbow(4),beside=T,xlab="Hair color",
legend=T,args.legend=list(x="topleft",bty="n"))
Hair and eye colour (Female)
Brown
60
Blue
Hazel
50
Green
40
30
20
10
0
Black Brown Red Blond
Hair color
Multi-way tables for categorical variates
The ftable function provides a neat format for contingency tables involving more than two variates. It
works on both raw data matrices and table objects. For example, using the HairEyeColor dataset, which is
a list of two tables (Male, Female), we obtain:
ftable(HairEyeColor)
Sex Male Female
Hair Eye
Black Brown 32 36
Blue 11 9
Hazel 10 5
Green 3 2
Brown Brown 53 66
Blue 50 34
Hazel 25 29
Green 15 14
Red Brown 10 16
Blue 10 7
Hazel 7 7
Green 7 7
Blond Brown 3 4
Blue 30 64
Hazel 5 5
5
Green 8 8
From raw data (personal.csv file):
Personal <- read.table("personal.csv",sep=",",header=T)
head(Personal)
Sex Status Education City
1 Female Married University Glasgow
2 Female Married Primary Edinburgh
3 Female Married Secondary Dundee
4 Female Married Primary Glasgow
5 Female Married Secondary Glasgow
6 Female Married Secondary Glasgow
ftable(Personal)
City Dundee Edinburgh Glasgow
Sex Status Education
Female Couple Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Divorced Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Married Primary 0 3 2
Secondary 1 2 5
University 0 1 1
Single Primary 0 2 1
Secondary 0 1 1
University 0 2 0
Male Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 1 0
Secondary 0 0 0
University 0 0 0
Single Primary 4 0 0
Secondary 2 0 1
University 4 0 2
The variates can be arranged in different ways by the arguments row.vars and col.vars:
ftable(Personal,row.vars=c("Status","Education"))
Sex Female Male
City Dundee Edinburgh Glasgow Dundee Edinburgh Glasgow
Status Education
Couple Primary 0 0 0 1 0 0
Secondary 0 0 0 1 0 0
University 0 0 0 1 0 0
Divorced Primary 0 0 0 0 0 0
Secondary 0 0 0 0 1 0
University 0 0 0 0 1 0
6
Married Primary 0 3 2 0 1 0
Secondary 1 2 5 0 0 0
University 0 1 1 0 0 0
Single Primary 0 2 1 4 0 0
Secondary 0 1 1 2 0 1
University 0 2 0 4 0 2
ftable(Personal,row.vars=c("Status","Education"),col.vars="City")
City Dundee Edinburgh Glasgow
Status Education
Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 4 2
Secondary 1 2 5
University 0 1 1
Single Primary 4 2 1
Secondary 2 1 2
University 4 2 2
Numerical variates
Linear relationship measures: covariance and correlation
Sample covariance and correlation between two variates x and y:
Pn
i=1 (xi − x̄)(yi − ȳ) Sxy
Sxy = rxy =
n Sx Sy
Using the airquality dataset in R:
help(airquality)
air <- airquality[,1:4] # We focus on experimental measurements
Descriptive measures:
(tip: explore options to deal with missing data when using the following functions)
colMeans(air) # Mean vector
Ozone Solar.R Wind Temp
NA NA 9.957516 77.882353
colMeans(air,na.rm=T) # Mean vector omiting missing values (NA)
Ozone Solar.R Wind Temp
42.129310 185.931507 9.957516 77.882353
apply(air,2,var,na.rm=T) # Variances
Ozone Solar.R Wind Temp
1088.20052 8110.51941 12.41154 89.59133
7
apply(air,2,sd,na.rm=T) # Standard deviations
Ozone Solar.R Wind Temp
32.987885 90.058422 3.523001 9.465270
var(air,na.rm=T) # Covariance matrix
Ozone Solar.R Wind Temp
Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cov(air,use="complete.obs") # Covariance matrix
Ozone Solar.R Wind Temp
Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cor(air,use="complete.obs") # Correlation matrix
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
cov2cor(cov(air,use="complete.obs")) # Convert covariance matrix into correlation matrix
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
Visualisation of relationships
Scatter plot
Reveals variables which are positively and negatively correlated, extreme observations, clusters, unusual
patterns, . . .
Plotting multivariate data is imperative before carrying out a formal statistical analysis.
plot(air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)")
8
20
15
Wind
10
5
60 70 80 90
Temperature (ªF)
# Adding smooth loess curve
scatter.smooth(air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)",lpars=list(col="red"))
20
15
Wind
10
5
60 70 80 90
Temperature (ªF)
Matrix of pairwise scatter plots:
plot(air) # (same as using pairs())
9
0 100 200 300 60 70 80 90
150
Ozone
0 50
300
Solar.R
150
0
15
Wind
5
80
Temp
60
0 50 100 150 5 10 15 20
3D scatter plot:
#install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(air$Temp,air$Wind,air$Solar.R,color="blue")
0 50 100 150 200 250 300 350
air$Solar.R
air$Wind
25
20
15
10
5
0
50 60 70 80 90 100
air$Temp
# Using the `iris` dataset in R
scatterplot3d(iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Width,color=as.numeric(iris$Species))
10
iris$Petal.Width
0.0 0.5 1.0 1.5 2.0 2.5
4
5
6
iris$Sepal.Length
7
11
8
2.0
2.5
3.0
3.5
4.0
4.5
iris$Sepal.Width