Data Visualization
Exercise 7: Integrated Analysis - K-means Clustering
using Tableau & R
As you proceed with the assignment, follow the written instructions. Screenshots
are provided ONLY as a reference.
Make sure you submit all screenshots with a clearly visible menu bar including the
date and timestamp.
Note: You must use the following conventions to name objects/systems created in
this exercise.
Objective:- The objective of this exercise is to get a hands-on experience K-means
clustering analysis using R and visualizing results through Tableau.
Below are the objectives of the exercise
1) Data retrieval
2) Data pre-processing
3) K-mean clustering
4) Tableau- R integration by invoking Rserve ()
Judd D. Bradbury Page 1
Data Visualization
1) Data Retrieval
1. Open a new project in R studio using the below screenshots
Judd D. Bradbury Page 2
Data Visualization
2. Select New Project
3. Name this directory to Kmean and browse the folder you are working in.
Judd D. Bradbury Page 3
Data Visualization
4. Create the new project by clicking on the ‘Create Project’ in this new working directory
and allow it to process for a minute.
5. Create a new R script (go to File -> New File -> R Script) and save it as
KmeanFirstNameLastName.R (replace FirstNameLastName by your FirstName
LastName) in the Kmean folder.
6. Place the titanic.csv file in the Kmean folder created in step 3
7. Read the csv file in the object called titanic as mentioned below.
**Make sure your titanic file is in the Kmean folder
The dataset is now in titanic data frame
8. Execute the command to check the number of rows and columns as mentioned below.
Q1) Write the output of the above executed command.
Answer:
2) Data Pre-processing
This steps involves replacing missing value of the ages with their mean, adding age
category column and replacing the label with a meaningful name
1. Execute the below command to calculate the average age of the passengers using the
below snapshot
Q2) What is the average age that you get?
Answer:
2. Execute the below commands to round off the age to their nearest integer.
Judd D. Bradbury Page 4
Data Visualization
3. Create a new variable (vector) called Age Category and assign a category to every
passenger in the dataset.
4. Replace the integer value of 0 and 1 in the survivor variable (vector) with a meaningful
labels.
5. Convert the integer and character vectors to factor variables as mentioned in the
screenshot for the other variables
6. Remove the other redundant variables such as Ticket and Cabin from the titanic data
frame.
7. Crosscheck the processed values using the command as mentioned below.
Q3) Paste the screenshot of the above executed command.
Answer:
8. Save the processed titanic dataset
Judd D. Bradbury Page 5
Data Visualization
3) K-means Clustering
1. Three important categorical variables in the titanic dataset are
1. Survived
2. Sex
3. Embarked
Execute the below command to convert the above 3 categorical variables into continuous.
The below code would create 3 new variables of type continuous and save them in the
new file “titanicUpdated.csv”.
Before the cluster analysis one has to select for the optimum number of clusters.
2. Execute the below command to normalize the data and find the total number of clusters
required.
Judd D. Bradbury Page 6
Data Visualization
3. Plot the two graphs using the code below
Q4) Paste the screenshot of the two plot
Answer:
Q5) Write the desired number of cluster you selected for the analysis
Answer:
(Hint: We need to keep adding clusters to the point where further addition of cluster won’t do much of
explanation of the variation. This is also the point where the slope of the curve changes suddenly and
gives an angle to the graph)
Now that we know what the optimum number of clusters to be used in our
integrated analysis what clustering variables are important? In the next section you
will perform visual analysis integrating Tableau with R to visualize the clusters in Tableau.
Judd D. Bradbury Page 7
Data Visualization
4) TABLEAU / R INTEGRATION
You will have already used Tableau for previous assignments. In this part of the exercise
we will
leverage the modeling power of R along with the visualization power of Tableau to
perform our
exercise in integrated analysis.
Step 1: Initiate Rserve and Making Connection
Type the following code on your R window
install.packages('Rserve',,'http://www.rforge.net/')
library(Rserve)
Rserve()
For mac users:
To start Rserve in Macs, we need to pass arguments to the Rserve() function.
Rserve(args = "--save")
In order for us to be able to make a connection between Tableau and R, we need to do it
through Rserve. Rserve itself is the server that is a program that responds to requests
from clients. It listens for any incoming connections and processes incoming requests.
1. Open Tableau
2. Go to the Help menu and select Settings and Performance and then “Manage
Analytics Extension Connection”.
3. Enter a server name of “Localhost” (or “127.0.0.1”) and a port of “6311”.
4. Click on the “Test Connection” button to make sure everything runs smoothly. You
should see a successful message. Click OK to close.
Step 2: Open the titanicUpdated.csv file on Tableau
Open the file “titanicUpdated.csv” as text file on Tableau. Save your Tableau file as
TitanicViz-Student Name.twbx (replace student name with your full name)
Note: Save it with .twbx file extension
On the “Data Source” tab on Tableau please select the Connection as “Extract”.
Screenshot is shown below for reference.
Judd D. Bradbury Page 8
Data Visualization
Step 3: Forming Clusters
Click on the New Worksheet in Tableau. On this worksheet you would see all the
dimensions and measures in your current dataset represented in Tableau.
Q6) Paste a screenshot of the current set of tables.
Ensure that the variables “Embarked N”, “Sex N” and “Survived Num” shows with Age,
Fare, Parch, Pclass etc.
Answer:
Parameter Creation:
Right click on the empty space within the tables window and select “Create Parameter”.
Name the parameter as “# of Clusters”. Select the data type as Integer and enter the
“Ideal number of Clusters”
that you obtained as part of this exercise under Current value cell. Radio button for “All”
would be
enabled under allowable values. Take a screenshot once you populate all fields of this
parameter.
Judd D. Bradbury Page 9
Data Visualization
Q7) Please paste screenshot once you populate all fields for this parameter.
Answer:
Similarly create another parameter called “Seed”, data type integer, and set the value as
1234. Now you would see two parameters under your parameter window of your
worksheet.
Q8) Paste a screenshot once you populate all fields for this “Seed” parameter.
Answer:
Judd D. Bradbury Page 10
Data Visualization
Calculated Field creation:
1. Right click on the empty space within the tables window and select “Create
Calculated Field”.
2. In the window that opens up, type the name as “Cluster” and type the below piece
of code
within the window and click OK.
Script_INT("
## Sets the seed
set.seed(.arg8[1])
## Studentizes the variables
age<-(.arg1-mean(.arg1))/sd(.arg1)
pclass<-(.arg2-mean(.arg2))/sd(.arg2)
embarkedn<-(.arg3-mean(.arg3))/sd(.arg3)
sex<-(.arg4-mean(.arg4))/sd(.arg4)
survived<-(.arg5-mean(.arg5))/sd(.arg5)
sibsp<-(.arg6-mean(.arg6))/sd(.arg6)
dat<-cbind(age,pclass,embarkedn,sex,survived,sibsp)
num<-.arg7[1]
## Creates the clusters
kmeans(dat,num)$cluster
",
SUM([Age]),SUM([Pclass]),SUM([EmbarkedN]),SUM([SexN]),SUM([SurvivedNum]),SUM(
[Sib Sp]),[# of Clusters],[Seed])
Judd D. Bradbury Page 11
Data Visualization
Note: The highlighted part that you see above are the actual variable names used in your
dataset. If your dataset has any other names for those measures, please use those
names here, else you would
see an error. Also, there may be a parsing error later while dragging ‘cluster’ to columns
field. To tackle this, please type this calculated field code as there may be a parsing error
while copying from MS Word to Tableau.
Step 4: Visualization of Key variables across Clusters
Objective: The objective of this step is to visualize how different variables vary across
different cluster.
MAKE SURE YOU USE BAR CHARTS FOR ALL VISUALIZATIONS EXCEPT FOR THE
FIRST VISIULATION.
Step 4.1: Overall Clusters
1. Name the worksheet “Overall Clusters”. As you configure the sheet, click on the x
to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
selection.
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
Judd D. Bradbury Page 12
Data Visualization
5. Pull the other variables SexN, Age, Pclass, EmbarkedN and SibSp on to the rows
field.
6. Click on “Box and Whisker plot” from the ShowMe tab.
7. Drag “Embarked” field from Dimensions to Color. Right click it and convert it to an
attribute. Make sure it is set to discrete.
8. Make sure the “Cluster” variable is set to Discrete and Survived field is set to
attribute. These two fields will appear in blue under Columns tab.
9. Ensure the sequence of the field is as below:
10. Go to worksheet – Show Title. Add your name to the title.
Q9) Paste the screenshot of the worksheet “Overall Clusters” below
Answer:
Q10) Provide your insights from cluster 1.
Answer:
Q11) Provide your insights from cluster 4.
Answer:
Key Takeaway: The key takeaway from this worksheet is that Sex is the most important
variable in the formation of cluster as most of the clusters have predominantly one
gender.
Judd D. Bradbury Page 13
Data Visualization
Step 4.2 : Survival by Cluster
Objective: The objective of this sheet is to see which clusters have highest survivability.
Task:
1. Click on new Worksheet and name it “Survival by Cluster”. As you configure the
sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
selection.
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf on left of cluster. Make sure the
Survived field is set to attribute.
Judd D. Bradbury Page 14
Data Visualization
5. Drag the “titanicUpdated.csv(Count)” variable to rows.
6. Select horizontal bar chart as visualization.
7. Also, select the variable “Survived” to Color section under “Marks” pane. Right click
it and convert it to an attribute.
8. Ensure the sequence of the field is as shown below:
Note: The visualization should be a bar chart
9. Go to worksheet – Show Title. Add your name to the title.
10. Right click on the vertical axis and select “Add reference line”
11. On the “Add Reference Line” screen, select Scope as “Entire table”, Value =
“CNT(titanicUpdated.csv)” and “average” and click OK.
Q 12) Paste the screenshot of your worksheet. Ensure that you have added a Title
and your name is contained in the title. Titles without names shall not be
considered for grading.
Answer:
Q 13) From the bar chart, could you name two clusters where survivability is the
highest?
Answer:
Judd D. Bradbury Page 15
Data Visualization
Going forward for our analysis, our focus would be on these two clusters.
Q 14) Enter the top two clusters you’ve identified and enter them against <Cluster
1> and <Cluster 2>. Leave the other rows blank for now.
<Cluster 1> <Cluster 2>
Ideal Gender
Ideal Passenger Class
Ideal Age Category
Ideal Embarked point
Ideal number of
siblings
Key Takeaway: The key takeaway from the above worksheet “Survival by Cluster” is
that only in case of two clusters, the number of passengers survived either outweigh or
compare to the number of passengers who didn’t survive. Hence these clusters form the
most important cluster for our analysis.
Step 4.3: Survival by Gender
Objective: The objective of this section is to understand which is the best gender from a
survivability perspective, in each of our top two clusters, identified in the previous step.
Task:
1. Click on new Worksheet and name it “Survival by Gender”. As you configure the
sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and deselect (turn off) the Aggregate Measures
selection.
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
Judd D. Bradbury Page 16
Data Visualization
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
5. Select titanUpdated.csv(Count) in the rows
6. Drag “Sex” to Color section under Marks pane. Right click it and convert it to an
attribute
7. Ensure the sequence of the field is as shown below:
Note: The visualization should be a bar chart
8. Go to worksheet – Show Title. Add your name to the title.
Q15) Paste your screenshot of the Tableau worksheet below. Make sure it carries
the Title with your name.
Answer:
Judd D. Bradbury Page 17
Data Visualization
Q16) If you are Female, which cluster should you belong, to ensure higher
probability of survival?
Answer:
Q17) Based on your findings, enter the ideal Gender (that has the best chance to
survive) in your top two cluster. (Keep filling all the rows serially)
<Cluster 1> <Cluster 2>
Ideal Gender <Enter the gender that has <Enter the gender that has
the best survivability in this the best survivability in this
cluster> cluster>
Ideal Passenger Class
Ideal Age Category
Ideal Embarked point
Ideal number of
siblings
Key Takeaway: The key takeaway from the above worksheet “Survival by Gender” is
that in the two prominent clusters that we identified in previous steps; one gender clearly
dominates the survivability. Note that cluster 1 is comprised of only male passengers.
Step 4.4: Survival by Passenger Class
Objective: The objective of this section is to understand which is the best gender/class
combination from a survivability perspective, in each of our top two clusters.
Task:
1. Click on new Worksheet and name it “Survival by Passenger Class”. As you
configure the sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
selection.
Judd D. Bradbury Page 18
Data Visualization
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
5. Select Sex and titanUpdated.csv(Count) in the rows. Sex should be attribute.
6. Drag “Pclass” to Color section under Marks pane. Right click it and convert it to an
attribute. Make sure it is set to discrete.
7. Ensure the sequence of the field is as shown below
Note: The visualization should be a bar chart
8. Go to worksheet – Show Title. Add your name to the title.
Judd D. Bradbury Page 19
Data Visualization
Q18) Paste your screenshot of the Tableau worksheet below. Ensure you have the
title with your name on it.
Answer:
Q19) Based on your findings, enter the ideal Gender/Passenger Class (that has the
best chance to survive) in your top two cluster. (Keep filling all the rows serially.
Data should be filled for all previous rows)
<Cluster 1> <Cluster 2>
Ideal Gender
Ideal Passenger Class <Please write the ideal <Please write the ideal
class for the gender in class for the gender in
above cell and above above cell and above
cluster.> cluster.>
Ideal Age Category
Ideal Embarked point
Ideal number of
siblings
Key Takeaway: The key takeaway from the above worksheet “Survival by Passenger
class” is that in some key clusters, most of the survivors belonged to a particular class.
Step 4.5: Survival by Age Category
Objective: The objective of this section is to understand which is the best gender/age
category combination from a survivability perspective, in each of our top two clusters.
Task:
1. Click on new Worksheet and name it “Survival by Age Category”. As you configure
the sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
selection.
Judd D. Bradbury Page 20
Data Visualization
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
5. Select Sex and titanUpdated.csv(Count) in the rows. Sex should be attribute.
6. Drag Age Category (Age Cat) to Color section under Marks pane. Right click it and
convert it to an attribute.
7. Ensure the sequence of the field is as shown below
Note: The visualization should be a bar chart
8. Go to worksheet – Show Title. Add your name to the title.
Judd D. Bradbury Page 21
Data Visualization
Q20) Paste your screenshot of the Tableau worksheet below. Ensure you add the
title with your name.
Answer:
Q21) : Based on your findings, enter the ideal Gender/Age Category Class (that has
the best chance to survive) in your top two cluster. (keep filling all the rows
serially. Data should be filled for all previous rows)
<Cluster 1> <Cluster 2>
Ideal Gender
Ideal Passenger Class
Ideal Age Category <Please write the ideal age <Please write the ideal age
category for the gender in category for the gender in
above cell and above above cell and above
cluster.> cluster.>
Ideal Embarked point
Ideal number of
siblings
Key Takeaway: Most of the female survivors were from the age category 17-32 years.
Most men who survived in cluster 1 also belonged to the age category 17-32 years.
Overall, we can say that because majority of the passengers seem to fall it the age group
17-32 years their survivability salience is high.
Step 4.6: Survival by Embarked Point
Objective: The objective of this section is to understand which is the best
gender/embarked point combination from a survivability perspective, in each of the top
two clusters.
Task:
1. Click on new Worksheet and name it “Survival by Embarked Point”. As you
configure the sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
Judd D. Bradbury Page 22
Data Visualization
selection.
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
5. Select Sex and titanUpdated.csv(Count) in the rows. Sex should be attribute.
6. Drag Embarked variable to Color section under Marks pane. Right click it and
convert it to an attribute.
Note: The visualization should be a bar chart
7. Ensure the sequence of the field is as shown below
Judd D. Bradbury Page 23
Data Visualization
8. Go to worksheet – Show Title. Add your name to the title.
Q22) Paste your screenshot of the Tableau worksheet below. Ensure you have
the title added with your name.
Answer:
Q23) Output: Based on your findings, enter the ideal Gender/Embarked point
(that has the best chance to survive) in your top two cluster. (keep filling all the
rows serially. Data should be filled for all previous rows)
<Cluster 1> <Cluster 2>
Ideal Gender
Ideal Passenger Class
Ideal Age Category
Ideal Embarked point <Please write the <Please write the
embarked point for the embarked point for the
gender in above cell and gender in above cell and
above cluster.> above cluster.>
Ideal number of
siblings
Key Takeaway: Most of the passengers embarked from Southampton. Amongst the
survivors, majority of the passengers were from Southampton.
Step 4.7: Survival by number of Siblings
Objective: The objective of this section is to understand which is the best gender/# of
sibling’s combination from a survivability perspective, in each of the top two clusters.
Task:
1. Click on new Worksheet and name it “Survival by number of Siblings”. As you
configure the sheet, click on the x to ignore the screen errors.
2. Click on the Analysis menu path and delselect (turn off) the Aggregate Measures
selection.
Judd D. Bradbury Page 24
Data Visualization
3. Drag the Cluster field to the Columns shelf. Ensure the Cluster is set to Discrete.
4. Drag the Survived field to the Columns shelf. Make sure the Survived field is set to
attribute.
5. Select Sex and titanUpdated.csv(Count) in the rows. Sex should be attribute.
6. Ensure the sequence of the field is as shown below
7. Drag Sib Sp (Measure) to Color section under Marks pane. Right click it and
convert it to an attribute. Make sure it is discrete.
Note: The visualization should be a bar chart
8. Go to worksheet – Show Title. Add your name to the title.
Judd D. Bradbury Page 25
Data Visualization
Q24) Paste your screenshot of the Tableau worksheet below. Ensure you have
the title with your name on it.
Answer:
Q25) Output: Based on your findings, enter the ideal Gender/# of Sib Sp (that
has the best chance to survive) in your top two cluster.
<Cluster 1> <Cluster 2>
Ideal Gender
Ideal Passenger Class
Ideal Age Category
Ideal Embarked point
Ideal number of <Please write the # of <Please write the # of
siblings siblings for the gender in siblings for the gender in
above cell and above above cell and above
cluster.> cluster.>
Key Takeaway: Majority of the passengers had zero siblings. But within cluster 5, for
both survived male and female passengers, we see # of siblings as 4. These however
constitutes a very small percentage of the total # of sibling count.
Step 5: Summary
At this stage, you should have enough information based on the above tabular data, to
profile the passengers that had the best chance of survivability in your top 2 cluster.
Q26) Please put the profile of passengers from each of the top two clusters in
words. Use the above filled in table to form your passenger profile. Please write
two separate sentences explaining each cluster profile.
Answer:
Judd D. Bradbury Page 26
Data Visualization
Attach assignments in elearning
1. Attach the assignment document (only solutions to questions)
in Microsoft Word
2. Attach all R program files
3. Attach the Tableau File (.twbx)
Judd D. Bradbury Page 27