Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views11 pages

L1 MultivDescriptive

Multivariate Data Analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

L1 MultivDescriptive

Multivariate Data Analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Basic description and visualisation of multivariate data in R

Categorical variates

Contingency tables

Load the Titanic data set from a csv file (note that dots were used to represent missing data).
Titanic <- read.csv("Titanic.csv",na.strings=".")
#
# Define Sex as factor (0 = Female, 1 = Male)
#
Titanic$Sex <- factor(Titanic$Sex,levels=c(0,1),labels=c("Female","Male"))

Joint distribution table (contingency table, cross tab, 2-way table): absolute frequencies.
table(Titanic$Sex,Titanic$Survived)

No Yes
Female 154 308
Male 708 142
Joint distribution table (contingency table): relative frequencies.
table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)

No Yes
Female 0.1172887 0.2345773
Male 0.5392232 0.1081493
See also ftable function.

Marginal distributions

Given two variates x, which takes values x1 , x2 , . . . , xn , and y, which takes values y1 , y2 , . . . , yn , the marginal
frequency distributions can be obtained as:

n
X n
X
f r(xi ) = f r(xi , yj ) f r(yj ) = f r(xi , yj )
j=1 i=1

Option 1: using table separately for each variate.


# Relative frequencies
table(Titanic$Survived)/length(Titanic$Survived)

No Yes
0.6565118 0.3427266
table(Titanic$Sex)/length(Titanic$Sex)

Female Male
0.3518660 0.6473724

1
Option 2: using addmargin on a contingency table.
tableTitanic <- table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)
addmargins(tableTitanic)

No Yes Sum
Female 0.1172887 0.2345773 0.3518660
Male 0.5392232 0.1081493 0.6473724
Sum 0.6565118 0.3427266 0.9992384

Conditional distributions

f r(xi , yj )
f r(xi |yj ) =
f r(yj )

Example: relative frequency of survival being a woman.

rel.f r(Survived = Y es, Sex = F emale)


rel.f r(Survived = Y es|Sex = F emale) = = 0.1671
rel.f r(Sex = F emale)

Conditional distributions of Survived given Sex:


prop.table(tableTitanic,1)

No Yes
Female 0.3333333 0.6666667
Male 0.8329412 0.1670588
The second argument of prop.table sets the dimension of the table which the conditioning variate is located
at. In this case Sex is located in the first dimension (rows).
Conditional distributions of Sex given Survived:
prop.table(tableTitanic,2)

No Yes
Female 0.1786543 0.6844444
Male 0.8213457 0.3155556

Relationship between pairs of categorical variates

The summary function can be used on a table class object in order to perform a Chi-square test of independence.
The null hypothesis is independence, no association.
summary(table(Titanic$Sex,Titanic$Survived))

Number of cases in table: 1312


Number of factors: 2
Test for independence of all factors:
Chisq = 331.5, df = 1, p-value = 4.444e-74
The p-value results to be extremely low and, hence, it provides support to conclude statistically significant
association between sex and survival.

2
Some graphical representations for categorical variates

Mosaics
plot(tableTitanic,main="Titanic",ylab="Survived",xlab="Sex",col=rainbow(2))

Titanic

Female Male
No
Survived

Yes

Sex
Using the HairEyeColor dataset included in R:
a <- as.table(HairEyeColor[,,"Male"]) # Extract table for Male category
a

Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
plot(a,main="Hair and eye colour (Male)",ylab="Eye",xlab="Hair",col=rainbow(4))

3
Hair and eye colour (Male)

Black Brown Red Blond


Brown
Eye

Blue
Green Hazel

Hair
b <- as.table(HairEyeColor[,,"Female"]) # Extract table for Female category
b

Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
plot(b,main="Hair and eye colour (Female)",ylab="Eye",xlab="Hair",col=rainbow(4))

Hair and eye colour (Female)

Black Brown Red Blond


Brown
Eye

BlueHazel
Green

Hair

4
Combined barplots
barplot(t(b),main="Hair and eye colour (Female)",col=rainbow(4),beside=T,xlab="Hair color",
legend=T,args.legend=list(x="topleft",bty="n"))

Hair and eye colour (Female)

Brown
60

Blue
Hazel
50

Green
40
30
20
10
0

Black Brown Red Blond

Hair color

Multi-way tables for categorical variates

The ftable function provides a neat format for contingency tables involving more than two variates. It
works on both raw data matrices and table objects. For example, using the HairEyeColor dataset, which is
a list of two tables (Male, Female), we obtain:
ftable(HairEyeColor)

Sex Male Female


Hair Eye
Black Brown 32 36
Blue 11 9
Hazel 10 5
Green 3 2
Brown Brown 53 66
Blue 50 34
Hazel 25 29
Green 15 14
Red Brown 10 16
Blue 10 7
Hazel 7 7
Green 7 7
Blond Brown 3 4
Blue 30 64
Hazel 5 5

5
Green 8 8
From raw data (personal.csv file):
Personal <- read.table("personal.csv",sep=",",header=T)
head(Personal)

Sex Status Education City


1 Female Married University Glasgow
2 Female Married Primary Edinburgh
3 Female Married Secondary Dundee
4 Female Married Primary Glasgow
5 Female Married Secondary Glasgow
6 Female Married Secondary Glasgow
ftable(Personal)

City Dundee Edinburgh Glasgow


Sex Status Education
Female Couple Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Divorced Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Married Primary 0 3 2
Secondary 1 2 5
University 0 1 1
Single Primary 0 2 1
Secondary 0 1 1
University 0 2 0
Male Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 1 0
Secondary 0 0 0
University 0 0 0
Single Primary 4 0 0
Secondary 2 0 1
University 4 0 2
The variates can be arranged in different ways by the arguments row.vars and col.vars:
ftable(Personal,row.vars=c("Status","Education"))

Sex Female Male


City Dundee Edinburgh Glasgow Dundee Edinburgh Glasgow
Status Education
Couple Primary 0 0 0 1 0 0
Secondary 0 0 0 1 0 0
University 0 0 0 1 0 0
Divorced Primary 0 0 0 0 0 0
Secondary 0 0 0 0 1 0
University 0 0 0 0 1 0

6
Married Primary 0 3 2 0 1 0
Secondary 1 2 5 0 0 0
University 0 1 1 0 0 0
Single Primary 0 2 1 4 0 0
Secondary 0 1 1 2 0 1
University 0 2 0 4 0 2
ftable(Personal,row.vars=c("Status","Education"),col.vars="City")

City Dundee Edinburgh Glasgow


Status Education
Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 4 2
Secondary 1 2 5
University 0 1 1
Single Primary 4 2 1
Secondary 2 1 2
University 4 2 2

Numerical variates

Linear relationship measures: covariance and correlation

Sample covariance and correlation between two variates x and y:


Pn
i=1 (xi − x̄)(yi − ȳ) Sxy
Sxy = rxy =
n Sx Sy

Using the airquality dataset in R:


help(airquality)
air <- airquality[,1:4] # We focus on experimental measurements

Descriptive measures:
(tip: explore options to deal with missing data when using the following functions)
colMeans(air) # Mean vector

Ozone Solar.R Wind Temp


NA NA 9.957516 77.882353
colMeans(air,na.rm=T) # Mean vector omiting missing values (NA)

Ozone Solar.R Wind Temp


42.129310 185.931507 9.957516 77.882353
apply(air,2,var,na.rm=T) # Variances

Ozone Solar.R Wind Temp


1088.20052 8110.51941 12.41154 89.59133

7
apply(air,2,sd,na.rm=T) # Standard deviations

Ozone Solar.R Wind Temp


32.987885 90.058422 3.523001 9.465270
var(air,na.rm=T) # Covariance matrix

Ozone Solar.R Wind Temp


Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cov(air,use="complete.obs") # Covariance matrix

Ozone Solar.R Wind Temp


Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cor(air,use="complete.obs") # Correlation matrix

Ozone Solar.R Wind Temp


Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
cov2cor(cov(air,use="complete.obs")) # Convert covariance matrix into correlation matrix

Ozone Solar.R Wind Temp


Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000

Visualisation of relationships

Scatter plot
Reveals variables which are positively and negatively correlated, extreme observations, clusters, unusual
patterns, . . .
Plotting multivariate data is imperative before carrying out a formal statistical analysis.
plot(air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)")

8
20
15
Wind

10
5

60 70 80 90

Temperature (ªF)
# Adding smooth loess curve
scatter.smooth(air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)",lpars=list(col="red"))
20
15
Wind

10
5

60 70 80 90

Temperature (ªF)
Matrix of pairwise scatter plots:
plot(air) # (same as using pairs())

9
0 100 200 300 60 70 80 90

150
Ozone

0 50
300

Solar.R
150
0

15
Wind

5
80

Temp
60

0 50 100 150 5 10 15 20

3D scatter plot:
#install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(air$Temp,air$Wind,air$Solar.R,color="blue")
0 50 100 150 200 250 300 350
air$Solar.R

air$Wind

25
20
15
10
5
0
50 60 70 80 90 100

air$Temp
# Using the `iris` dataset in R
scatterplot3d(iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Width,color=as.numeric(iris$Species))

10
iris$Petal.Width

0.0 0.5 1.0 1.5 2.0 2.5

4
5
6

iris$Sepal.Length
7

11
8
2.0
2.5
3.0
3.5
4.0
4.5

iris$Sepal.Width

You might also like