Web and Social Media Analytics
Group Assignment
Business Intelligence using Text Mining
Submitted By:
Simran Saha
Srinidhi Narsimhan
Indushree Anandraj
Kalai Anbumani
Problem Statement: A dataset of Shark Tank episodes is made available. It contains 495
entrepreneurs making their pitch to the VC sharks. Evaluate CART Tree, Random Forest and Logistic
Regression models before and after adding the variable ‘Ratio’.
Reading the data:
Sharktank = read.csv("Shark+Tank+Companies.csv", stringsAsFactors=FALSE)
Data transformation and Cleaning
Transform dataset into a corpus with required variable i.e. description
# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))
# Convert to lower-case
corpus = tm_map(corpus, tolower)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
# Word cloud before removing stopwords
wordcloud(corpus,colors=rainbow(7),max.words=100)
# Remove stopwords, the, and
corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))
# Remove extra whitespaces if any
corpus = tm_map(corpus, stripWhitespace)
# Stem document
corpus = tm_map(corpus, stemDocument)
# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)
Create Document term Matrix
#Building Document term matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
To reduce the dimensions in DTM, we will remove less frequent words using
removeSparseTerms and sparsity less than 0.995
# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
Convert to data frame and final step for data preparation
# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))
# Make all variable names R-friendly
colnames(descSparse) = make.names(colnames(descSparse))
Adding the dependent variable:
# Add dependent variable
descSparse$deal = Sharktank$deal
#Get no of deals
table(descSparse$deal)
Building Models
CART MODEL:
SharktankCart = rpart(deal ~ ., data=descSparse, method="class")
Before:
prp(SharktankCart, extra=2)
Evaluate the performance of the CART model
predictCART = predict(SharktankCart, data=descSparse, type="class")
CART_initial <- table(descSparse$deal, predictCART)
CART_initial
Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)
Random Forest:
SharktankRF = randomForest(deal ~ ., data=descSparse)
Evaluate the performance of the Random Forest
RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)
BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)
BaseAccuracyRF
varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)
Logistic Regression
Sharktanklogistic = glm(deal~., data = descSparse)
# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)
Evaluate the performance of the Random Forest
LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)
LogisticInitial
# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)
Adding the variable ‘Ratio’
descSparse$ratio = Sharktank$askedFor/Sharktank$valuation
We will Re-run the models to check the changes.
CART MODEL:
SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")
#CART Diagram
prp(SharktankCartRatio, extra=2)
Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")
CART_ratio <- table(descSparse$deal, predictCARTRatio)
# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)
Random Forest:
SharktankRFRatio = randomForest(deal ~ ., data=descSparse)
#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)
Evaluate the performance of the Random Forest
RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)
# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)
#variable importance as measured by a Random Forest
varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with
Ratio',type=2)
Logistic Regression:
SharktanklogisticRatio = glm(deal~., data = descSparse)
# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)
Evaluate the performance of the Random Forest
LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)
LogisticRatio
# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)
Accuracy of the models:
BEFORE AFTER
CART MODEL 0.6565657 0.6606061
RANDOM FOREST 0.5535354 0.5636364
LOGISTIC REGRESSION 1 1
With CART Model we were able to predict around 65.65% and 66.06% accurate results using only
description and description+ratio respectively. Using Random Forest, we were able to predict 55.35%
and 55.75% accurate results using only description and description+ratio respectively.
With Logistic regression, it gave us 100% accuracy with both parameters however, this requires
further validation with significant variables and remove unnecessary variables to derive a measureable
output.