0% found this document useful (0 votes)

148 views9 pages

Shark Tank Data Analysis Report

The document describes analyzing models (CART tree, random forest, logistic regression) to predict funding deals on the TV show Shark Tank using a dataset of 495 pitches. Models are built using only description data and then retrained adding a "Ratio" variable. Accuracy is evaluated before and after adding Ratio. CART accuracy increased slightly from 65.6% to 66.1% after adding Ratio. Random forest accuracy increased from 55.3% to 56.4%. Logistic regression accuracy remained at 100% both times.

Uploaded by

Simran Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views9 pages

Shark Tank Data Analysis Report

Uploaded by

Simran Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Web and Social Media Analytics

Group Assignment

Business Intelligence using Text Mining

Submitted By:
Simran Saha
Srinidhi Narsimhan
Indushree Anandraj
Kalai Anbumani
Problem Statement: A dataset of Shark Tank episodes is made available. It contains 495
entrepreneurs making their pitch to the VC sharks. Evaluate CART Tree, Random Forest and Logistic
Regression models before and after adding the variable ‘Ratio’.

 Reading the data:

Sharktank = read.csv("Shark+Tank+Companies.csv", stringsAsFactors=FALSE)

 Data transformation and Cleaning

 Transform dataset into a corpus with required variable i.e. description

# Create corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords

wordcloud(corpus,colors=rainbow(7),max.words=100)

# Remove stopwords, the, and

corpus = tm_map(corpus, removeWords, c("the", "and", stopwords("english")))

# Remove extra whitespaces if any

corpus = tm_map(corpus, stripWhitespace)

# Stem document
corpus = tm_map(corpus, stemDocument)
# Word cloud after removing stopwords and cleaning
wordcloud(corpus,colors=rainbow(7),max.words=100)

 Create Document term Matrix

#Building Document term matrix

frequencies = DocumentTermMatrix(corpus)
frequencies

To reduce the dimensions in DTM, we will remove less frequent words using
removeSparseTerms and sparsity less than 0.995

# Remove sparse terms

sparse = removeSparseTerms(frequencies, 0.995)

Convert to data frame and final step for data preparation

# Convert to a data frame
descSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly

colnames(descSparse) = make.names(colnames(descSparse))

Adding the dependent variable:

# Add dependent variable
descSparse$deal = Sharktank$deal

#Get no of deals
table(descSparse$deal)
 Building Models

 CART MODEL:

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

Before:
prp(SharktankCart, extra=2)

Evaluate the performance of the CART model

predictCART = predict(SharktankCart, data=descSparse, type="class")

CART_initial <- table(descSparse$deal, predictCART)
CART_initial
Baseline accuracy
BaseAccuracyCart = sum(diag(CART_initial))/sum(CART_initial)

 Random Forest:

SharktankRF = randomForest(deal ~ ., data=descSparse)

Evaluate the performance of the Random Forest

RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

BaseAccuracyRF = sum(diag(RandomForestInitial))/sum(RandomForestInitial)
BaseAccuracyRF

varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)

 Logistic Regression

Sharktanklogistic = glm(deal~., data = descSparse)

# Make predictions:
predictLogistic = predict(Sharktanklogistic, data=descSparse)

Evaluate the performance of the Random Forest

LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

LogisticInitial

# Baseline accuracy
BaseAccuracyLogistic = sum(diag(LogisticInitial))/sum(LogisticInitial)

 Adding the variable ‘Ratio’

descSparse$ratio = Sharktank$askedFor/Sharktank$valuation

We will Re-run the models to check the changes.

 CART MODEL:

SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

#CART Diagram
prp(SharktankCartRatio, extra=2)
Evaluate the performance of the CART model
predictCARTRatio = predict(SharktankCartRatio, data=descSparse, type="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

# Baseline accuracy
BaseAccuracyRatio = sum(diag(CART_ratio))/sum(CART_ratio)
 Random Forest:

SharktankRFRatio = randomForest(deal ~ ., data=descSparse)

#Make predictions:
predictRFRatio = predict(SharktankRFRatio, data=descSparse)

Evaluate the performance of the Random Forest

RandomForestRatio <- table(descSparse$deal, predictRFRatio>= 0.5)

# Baseline accuracy
BaseAccuracyRFRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

#variable importance as measured by a Random Forest

varImpPlot(SharktankRFRatio,main='Variable Importance Plot: Shark Tank with
Ratio',type=2)
 Logistic Regression:

SharktanklogisticRatio = glm(deal~., data = descSparse)

# Make predictions:
predictLogisticRatio = predict(SharktanklogisticRatio, data=descSparse)

Evaluate the performance of the Random Forest

LogisticRatio <- table(descSparse$deal, predictLogisticRatio>= 0.5)
LogisticRatio

# Baseline accuracy
BaseAccuracyLogisticRatio = sum(diag(LogisticRatio))/sum(LogisticRatio)

 Accuracy of the models:

BEFORE AFTER
CART MODEL 0.6565657 0.6606061
RANDOM FOREST 0.5535354 0.5636364
LOGISTIC REGRESSION 1 1

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only
description and description+ratio respectively. Using Random Forest, we were able to predict 55.35%
and 55.75% accurate results using only description and description+ratio respectively.

With Logistic regression, it gave us 100% accuracy with both parameters however, this requires
further validation with significant variables and remove unnecessary variables to derive a measureable
output.

RR Trent 60
100% (6)
RR Trent 60
39 pages
Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
Flight Fare Prediction Project
No ratings yet
Flight Fare Prediction Project
15 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
AFES English Manual
100% (7)
AFES English Manual
290 pages
C32 - PFRS 5 Noncurrent Asset Held For Sale
No ratings yet
C32 - PFRS 5 Noncurrent Asset Held For Sale
4 pages
Shark Tank Pitch Analysis Models
No ratings yet
Shark Tank Pitch Analysis Models
9 pages
Shark Tank - Web and Social Media Analytics Case Study
100% (1)
Shark Tank - Web and Social Media Analytics Case Study
9 pages
Classification Models
No ratings yet
Classification Models
3 pages
Predective Analytics
No ratings yet
Predective Analytics
11 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Machine Learning
100% (2)
Machine Learning
30 pages
AI Strategy Flow Chart Share by WorldLine Technology
No ratings yet
AI Strategy Flow Chart Share by WorldLine Technology
1 page
CO 2 Session 3
No ratings yet
CO 2 Session 3
39 pages
Thera Bank
100% (1)
Thera Bank
25 pages
CH 5
No ratings yet
CH 5
42 pages
Business Intelligence Using Text Mining Assignment
No ratings yet
Business Intelligence Using Text Mining Assignment
1 page
Session 14 (Theory)
No ratings yet
Session 14 (Theory)
121 pages
MKT4080-Codes
No ratings yet
MKT4080-Codes
9 pages
Asssiment 3
No ratings yet
Asssiment 3
3 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
ML Record
No ratings yet
ML Record
23 pages
(BI 2025-1) Lesson15
No ratings yet
(BI 2025-1) Lesson15
70 pages
Teaching MBAs With Chatgpt - Lessons Learned
No ratings yet
Teaching MBAs With Chatgpt - Lessons Learned
22 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Churn Prediction Algorithms Study
No ratings yet
Churn Prediction Algorithms Study
25 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
2-Machine Learning Algorithms
No ratings yet
2-Machine Learning Algorithms
16 pages
Dsbda Prelim QB Solution
No ratings yet
Dsbda Prelim QB Solution
11 pages
Big Mart Sales Prediction Using ML
No ratings yet
Big Mart Sales Prediction Using ML
18 pages
AIML Solved Paper Nov-Dec 2024
No ratings yet
AIML Solved Paper Nov-Dec 2024
2 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
BAR, Term - V
No ratings yet
BAR, Term - V
3 pages
PGP-Data Science - Course Module With Internship Module
No ratings yet
PGP-Data Science - Course Module With Internship Module
17 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
Machine Learning for Decision Making
No ratings yet
Machine Learning for Decision Making
5 pages
Dsbda Ut4
No ratings yet
Dsbda Ut4
12 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
ML Models
No ratings yet
ML Models
21 pages
Simafire Logistic Regression Article Digest
No ratings yet
Simafire Logistic Regression Article Digest
11 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
Commonly Used Machine Learning Algorithms (With Python and R Codes)
No ratings yet
Commonly Used Machine Learning Algorithms (With Python and R Codes)
19 pages
DS Food
No ratings yet
DS Food
23 pages
Machine Learning for Data Scientists
No ratings yet
Machine Learning for Data Scientists
3 pages
Hmls
No ratings yet
Hmls
126 pages
Revenue Predictor - Udit Ennam PDF
No ratings yet
Revenue Predictor - Udit Ennam PDF
30 pages
Institute of Management Technology, Ghaziabad End Term Exam (Term - VII) Take Home Exam (Time Duration: 2.30 HRS) Batch 2019 - 21 Answer-Sheet
No ratings yet
Institute of Management Technology, Ghaziabad End Term Exam (Term - VII) Take Home Exam (Time Duration: 2.30 HRS) Batch 2019 - 21 Answer-Sheet
18 pages
Finarb Experience C
No ratings yet
Finarb Experience C
4 pages
Predictive Analytics Steps
No ratings yet
Predictive Analytics Steps
13 pages
SML
No ratings yet
SML
8 pages
Section 4
No ratings yet
Section 4
40 pages
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
No ratings yet
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
99 pages
Surabhi Charu Project
No ratings yet
Surabhi Charu Project
16 pages
Rameshwari Patil
No ratings yet
Rameshwari Patil
3 pages
Data Analysis Chap 3
No ratings yet
Data Analysis Chap 3
21 pages
SMDS Unit 5
No ratings yet
SMDS Unit 5
21 pages
Logistic Regression
No ratings yet
Logistic Regression
3 pages
FAI Lecture - 4-10-2023 PDF
No ratings yet
FAI Lecture - 4-10-2023 PDF
27 pages
ML 01 (Shubham)
No ratings yet
ML 01 (Shubham)
14 pages
Analysis:: Sample Size 30 The Summary Is Described As Below
No ratings yet
Analysis:: Sample Size 30 The Summary Is Described As Below
4 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
Group Assignment - Predictive Modelling
No ratings yet
Group Assignment - Predictive Modelling
23 pages
Group Assignment: Machine Learning: TOPIC: Predicting of Census Data Using Machine Learning Techniques
No ratings yet
Group Assignment: Machine Learning: TOPIC: Predicting of Census Data Using Machine Learning Techniques
11 pages
Advanced Statistics Group Assignment
No ratings yet
Advanced Statistics Group Assignment
7 pages
Ict Policies and Issues Implication To Teaching and Learning
100% (3)
Ict Policies and Issues Implication To Teaching and Learning
25 pages
Quality Plan
No ratings yet
Quality Plan
1 page
RAN Network Optimization Parameter Reference RAN6 1
No ratings yet
RAN Network Optimization Parameter Reference RAN6 1
371 pages
Behavior Aspect of Public Sector Planning and Budg
No ratings yet
Behavior Aspect of Public Sector Planning and Budg
3 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Hartley Oscillator
No ratings yet
Hartley Oscillator
4 pages
Malayala Manorama Company Limited
100% (2)
Malayala Manorama Company Limited
31 pages
Bored Cast-In Situ Piles
100% (1)
Bored Cast-In Situ Piles
7 pages
Declaration of Trust
83% (6)
Declaration of Trust
3 pages
TR Bro Updated Erl221
No ratings yet
TR Bro Updated Erl221
4 pages
APTs 1st Set of 20 of 240 Printable MCQs For AQA As Econ Sect 1
No ratings yet
APTs 1st Set of 20 of 240 Printable MCQs For AQA As Econ Sect 1
15 pages
2015 Renault Trafic 63463 PDF
No ratings yet
2015 Renault Trafic 63463 PDF
292 pages
SAP MM - Purchase Info Record
100% (1)
SAP MM - Purchase Info Record
6 pages
Nature and Scope of Rural Development
No ratings yet
Nature and Scope of Rural Development
59 pages
Developing Human Resources Through Educational Institute in Bangladesh
100% (4)
Developing Human Resources Through Educational Institute in Bangladesh
13 pages
Solution Practice 6 Consolidations 3
No ratings yet
Solution Practice 6 Consolidations 3
8 pages
Ab Final
No ratings yet
Ab Final
32 pages
Solar & Crank Emergency Radio Guide
100% (2)
Solar & Crank Emergency Radio Guide
28 pages
ZBAA
No ratings yet
ZBAA
53 pages
Brocade 3850 Specifications
No ratings yet
Brocade 3850 Specifications
2 pages
Military Flight Simulators
No ratings yet
Military Flight Simulators
3 pages
Affidavit for Name Discrepancy Correction
No ratings yet
Affidavit for Name Discrepancy Correction
5 pages
Hygromatik Electrode Steam Humidifiers EU 2011
No ratings yet
Hygromatik Electrode Steam Humidifiers EU 2011
6 pages
UOP Alkylation Technologies Overview
No ratings yet
UOP Alkylation Technologies Overview
1 page
Compression: DMET501 - Introduction To Media Engineering
No ratings yet
Compression: DMET501 - Introduction To Media Engineering
26 pages
8-Step Guide to Effective Gemba Walks
No ratings yet
8-Step Guide to Effective Gemba Walks
10 pages
Maths Grade 12 15 August 2025
No ratings yet
Maths Grade 12 15 August 2025
9 pages

Shark Tank Data Analysis Report

Uploaded by

Shark Tank Data Analysis Report

Uploaded by

Web and Social Media Analytics

Business Intelligence using Text Mining

 Reading the data:

Sharktank = read.csv("Shark+Tank+Companies.csv", stringsAsFactors=FALSE)

 Data transformation and Cleaning

 Transform dataset into a corpus with required variable i.e. description

# Word cloud before removing stopwords

# Remove stopwords, the, and

# Remove extra whitespaces if any

 Create Document term Matrix

#Building Document term matrix

# Remove sparse terms

Convert to data frame and final step for data preparation

# Make all variable names R-friendly

Adding the dependent variable:

SharktankCart = rpart(deal ~ ., data=descSparse, method="class")

Evaluate the performance of the CART model

predictCART = predict(SharktankCart, data=descSparse, type="class")

SharktankRF = randomForest(deal ~ ., data=descSparse)

Evaluate the performance of the Random Forest

RandomForestInitial <- table(descSparse$deal, predictRF>= 0.5)

varImpPlot(SharktankRF,main='Variable Importance Plot: Shark Tank',type=2)

Sharktanklogistic = glm(deal~., data = descSparse)

Evaluate the performance of the Random Forest

LogisticInitial <- table(descSparse$deal, predictLogistic> 0.5)

 Adding the variable ‘Ratio’

We will Re-run the models to check the changes.

SharktankCartRatio = rpart(deal ~ ., data=descSparse, method="class")

CART_ratio <- table(descSparse$deal, predictCARTRatio)

SharktankRFRatio = randomForest(deal ~ ., data=descSparse)

Evaluate the performance of the Random Forest

#variable importance as measured by a Random Forest

SharktanklogisticRatio = glm(deal~., data = descSparse)

Evaluate the performance of the Random Forest

 Accuracy of the models:

You might also like