Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
148 views9 pages

Shark Tank Data Analysis Report

The document describes analyzing models (CART tree, random forest, logistic regression) to predict funding deals on the TV show Shark Tank using a dataset of 495 pitches. Models are built using only description data and then retrained adding a "Ratio" variable. Accuracy is evaluated before and after adding Ratio. CART accuracy increased slightly from 65.6% to 66.1% after adding Ratio. Random forest accuracy increased from 55.3% to 56.4%. Logistic regression accuracy remained at 100% both times.

Uploaded by

Simran Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views9 pages

Shark Tank Data Analysis Report

The document describes analyzing models (CART tree, random forest, logistic regression) to predict funding deals on the TV show Shark Tank using a dataset of 495 pitches. Models are built using only description data and then retrained adding a "Ratio" variable. Accuracy is evaluated before and after adding Ratio. CART accuracy increased slightly from 65.6% to 66.1% after adding Ratio. Random forest accuracy increased from 55.3% to 56.4%. Logistic regression accuracy remained at 100% both times.

Uploaded by

Simran Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Web and Social Media Analytics

Group Assignment

Business Intelligence using Text Mining

Submitted By:
Simran Saha
Srinidhi Narsimhan
Indushree Anandraj
Kalai Anbumani
Problem Statement: A dataset of Shark Tank episodes is made available. It contains 495
entrepreneurs making their pitch to the VC sharks. Evaluate CART Tree, Random Forest and Logistic
Regression models before and after adding the variable ‘Ratio’.

 Reading the data:

Sharktank = read.csv("Shark+Tank+Companies.csv", stringsAsFactors=FALSE)

 Data transformation and Cleaning

 Transform dataset into a corpus with required variable i.e. description

# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords


wordcloud(corpus,colors=rainbow(7),max.words=100)

# Remove stopwords, the, and


corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))

# Remove extra whitespaces if any


corpus = tm_map(corpus, stripWhitespace)

# Stem document
corpus = tm_map(corpus, stemDocument)
# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)

 Create Document term Matrix

#Building Document term matrix


frequencies = DocumentTermMatrix(corpus)
frequencies

To reduce the dimensions in DTM, we will remove less frequent words using
removeSparseTerms and sparsity less than 0.995

# Remove sparse terms


sparse = removeSparseTerms(frequencies, 0.995)

Convert to data frame and final step for data preparation


# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly


colnames(descSparse) = make.names(colnames(descSparse))

Adding the dependent variable:


# Add dependent variable
descSparse$deal = Sharktank$deal

#Get no of deals
table(descSparse$deal)
 Building Models

 CART MODEL:

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

Before:
prp(SharktankCart, extra=2)

Evaluate the performance of the CART model

predictCART = predict(SharktankCart, data=descSparse, type="class")


CART_initial <- table(descSparse$deal, predictCART)
CART_initial
Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)

 Random Forest:

SharktankRF = randomForest(deal ~ ., data=descSparse)

Evaluate the performance of the Random Forest

RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)
BaseAccuracyRF

varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)


 Logistic Regression

Sharktanklogistic = glm(deal~., data = descSparse)

# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)

Evaluate the performance of the Random Forest

LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

LogisticInitial

# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)

 Adding the variable ‘Ratio’

descSparse$ratio = Sharktank$askedFor/Sharktank$valuation

We will Re-run the models to check the changes.

 CART MODEL:

SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCartRatio, extra=2)
Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)
 Random Forest:

SharktankRFRatio = randomForest(deal ~ ., data=descSparse)

#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)

Evaluate the performance of the Random Forest


RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)

# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

#variable importance as measured by a Random Forest


varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with
Ratio',type=2)
 Logistic Regression:

SharktanklogisticRatio = glm(deal~., data = descSparse)

# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)

Evaluate the performance of the Random Forest


LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)
LogisticRatio

# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)

 Accuracy of the models:

BEFORE AFTER
CART MODEL 0.6565657 0.6606061
RANDOM FOREST 0.5535354 0.5636364
LOGISTIC REGRESSION 1 1

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only
description and description+ratio respectively. Using Random Forest, we were able to predict 55.35%
and 55.75% accurate results using only description and description+ratio respectively.

With Logistic regression, it gave us 100% accuracy with both parameters however, this requires
further validation with significant variables and remove unnecessary variables to derive a measureable
output.

You might also like