Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
794 views5 pages

Problem1:: Pharmaceuticals - Csv. For Each Firm, The Following Variables Are Recorded

The document discusses using hierarchical cluster analysis on two datasets: the first contains crime, poverty, and income data for US cities, with the analysis finding 3 clusters; the second contains financial data for pharmaceutical companies, with the analysis identifying 4 clusters based on variables like market capitalization, profit margins, and growth rates. Association rule mining is also applied to course enrollment data, identifying the 6 strongest relationships between combinations of statistics courses.

Uploaded by

sumit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
794 views5 pages

Problem1:: Pharmaceuticals - Csv. For Each Firm, The Following Variables Are Recorded

The document discusses using hierarchical cluster analysis on two datasets: the first contains crime, poverty, and income data for US cities, with the analysis finding 3 clusters; the second contains financial data for pharmaceutical companies, with the analysis identifying 4 clusters based on variables like market capitalization, profit margins, and growth rates. Association rule mining is also applied to course enrollment data, identifying the 6 strongest relationships between combinations of statistics courses.

Uploaded by

sumit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Problem1:

Given data set (raw1) talk about crime, poverty and income of different US cities.
Use Hierarchical cluster analysis to explore and analyze the given dataset as follows:
a. Which method you are Using and Why?
b. How many clusters? And why
c. Distribution of cluster
Ans:
I have used Agglomerative clustering. This is because agglomerative method is less
rigid than divisive clustering. I have used the hclust function for the same. The code
we used for doing this in R is as follows:

>input <- read.csv("raw1.csv", header = T)


> dim(input)
[1] 41 4
> mydata <- input[1:41, 2:4]
> dim(mydata)
[1] 41 3
> normalized_data <- scale(mydata)
> View(normalized_data)
>d<-dist(normalized_data,method=”Euclidean”)
> hc <- hclust(d, method = "complete")
> plot(hc)
> plot(hc,labels = input$City,hang = -4)

Problem2
Pharmaceutical Industry. An equities analyst is studying the pharmaceutical industry and
would like your help in exploring and understanding the financial data collected by her firm.
Her main objective is to understand the structure of the pharmaceutical industry using some
basic financial measures.
Financial data gathered on 21 firms in the pharmaceutical industry are available in the file
Pharmaceuticals.csv. For each firm, the following variables are recorded:
1. Market capitalization (in billions of dollars)
2. Beta
3. Price/earnings ratio
4. Return on equity
5. Return on assets
6. Asset turnover
7. Leverage
8. Estimated revenue growth
9. Net profit margin
10. Median recommendation (across major brokerages)
11. Location of firm’s headquarters
12. Stock exchange on which the firm is listed
Use Hierarchical cluster analysis to explore and analyze the given dataset as follows:
a. Use only the numerical variables (1 to 9) to cluster the 21 firms. Justify the various
Choices made in conducting the cluster analysis, such as weights for different variables,
The specific clustering algorithm(s) used, the number of clusters formed, and so on.
b. Interpret the clusters with respect to the numerical variables used in forming the
Clusters.
c. Is there a pattern in the clusters with respect to the numerical variables (10 to 12)?
(Those not used in forming the clusters)

Ans:

> data<-read.csv("Pharmaceuticals.csv",header=T)

> data

> dim(data)

> pharma<-data[1:21,1:9]

> pharma

> pharma<-data[1:21,3:11]

> abdata<-scale(pharma)

> View(abdata)

> fit<-hclust(abdata)

> d<-dist(abdata,method="euclidean")

> plot(d)

> fit<-hclust(d)

> plot(fit)

> plot(fit,labels=data$Market_Cap,hang=-1)

> plot(fit,labels=data$Name,hang=-1)

> plot(fit,labels=data$Location,hang=-1)
C.
Problem3
Identifying Course Combinations. The Institute for Statistics Education at Statistics.com
offers online courses in statistics and analytics, and is seeking information that will help in
packaging and sequencing courses. Consider the data in the file Course-Topics.csv, the first
few rows of which are shown in below Table (Coursetopics.csv). These data are for
purchases of online statistics courses at Statistics.com. Each row represents the courses
attended by a single customer. The firm wishes to assess alternative sequencings and
bundling of courses. Use association rules to analyse these data, and draw course frequency
and find out top 6 rules based on lift value using support and confidence are .01 and .5
respectively.

ct <- read.csv("Coursetopics.csv")
ct.mat <- as.matrix(ct)
ct.tran <- as(ct.mat,"transactions")
inspect(ct.tran)
rules.all <- apriori(ct.tran,parameter=list(minlen=2, supp=0.01,conf=0.5))
inspect(sort(rules.all,by="lift")[1:6])

You might also like