Part 1: Yelp Dataset Profiling and Understanding
1. Profile the data by finding the total number of records for
each of the tables below:
i. Attribute table = 10000
ii. Business table = 10000
iii. Category table = 10000
iv. Checkin table = 10000
v. elite_years table = 10000
vi. friend table = 10000
vii. hours table = 10000
viii. photo table = 10000
ix. review table = 10000
x. tip table = 10000
xi. user table = 10000
***********SQL CODE**********
SELECT COUNT(*)
FROM table
*****************************
2. Find the total distinct records by either the foreign key or
primary key for each table. If two foreign keys are listed in
the table, please specify which foreign key.
i. Business = id: 10000
ii. Hours = business_id: 1562
iii. Category = business_id: 2643
iv. Attribute = business_id: 1115
v. Review = id: 10000, business_id: 8090, user_id: 9581
vi. Checkin = business_id: 493
vii. Photo = id: 10000, photo: 6493
viii. Tip = user_id: 537, business_id: 3979
ix. User = id: 10000
x. Friend = user_id: 11
xi. Elite_years = user_id: 2780
Note: Primary Keys are denoted in the ER-Diagram with a yellow
key icon.
***********SQL CODE**********
SELECT COUNT(DISTINCT Keys)
FROM table
*****************************
3. Are there any columns with null values in the Users table?
Indicate "yes," or "no."
Answer: no
SQL code used to arrive at answer:
SELECT COUNT(*)
FROM user
WHERE id IS NULL
OR name IS NULL
OR review_count IS NULL
OR yelping_since IS NULL
OR useful IS NULL
OR funny IS NULL
OR cool IS NULL
OR fans IS NULL
OR average_stars IS NULL
OR compliment_hot IS NULL
OR compliment_more IS NULL
OR compliment_profile IS NULL
OR compliment_cute IS NULL
OR compliment_list IS NULL
OR compliment_note IS NULL
OR compliment_plain IS NULL
OR compliment_cool IS NULL
OR compliment_funny IS NULL
OR compliment_writer IS NULL
OR compliment_photos IS NULL
4. For each table and column listed below, display the smallest
(minimum), largest (maximum), and average (mean) value for the
following fields:
i. Table: Review, Column: Stars
min: 1 max: 5 avg: 3.7082
ii. Table: Business, Column: Stars
min: 1.0 max: 5.0 avg: 3.6549
iii. Table: Tip, Column: Likes
min: 0 max: 2 avg: 0.0144
iv. Table: Checkin, Column: Count
min: 1 max: 53 avg: 1.9414
v. Table: User, Column: Review_count
min: 0 max: 2000 avg: 24.2995
***********SQL CODE**********
SELECT MIN(Column),MAX(Column),AVG(Column)
FROM table
*****************************
5. List the cities with the most reviews in descending order:
SQL code used to arrive at answer:
SELECT city,SUM(review_count) AS VALUE
FROM business
GROUP BY city
ORDER BY VALUE DESC
Copy and Paste the Result Below:
+-----------------+-------+
| city | VALUE |
+-----------------+-------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montréal | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+-----------------+-------+
(Output limit exceeded, 25 of 362 total rows shown)
6. Find the distribution of star ratings to the business in the
following cities:
i. Avon
SQL code used to arrive at answer:
SELECT SUM(review_count) AS Numbers, stars
FROM business
WHERE city == "Avon"
GROUP BY stars
Copy and Paste the Resulting Table Below (2 columns – star
rating and count):
+---------+-------+
| Numbers | stars |
+---------+-------+
| 10 | 1.5 |
| 6 | 2.5 |
| 88 | 3.5 |
| 21 | 4.0 |
| 31 | 4.5 |
| 3 | 5.0 |
+---------+-------+
ii. Beachwood
SQL code used to arrive at answer:
SELECT SUM(review_count) AS Numbers, stars
FROM business
WHERE city == "Beachwood"
GROUP BY stars
Copy and Paste the Resulting Table Below (2 columns – star
rating and count):
+---------+-------+
| Numbers | stars |
+---------+-------+
| 8 | 2.0 |
| 3 | 2.5 |
| 11 | 3.0 |
| 6 | 3.5 |
| 69 | 4.0 |
| 17 | 4.5 |
| 23 | 5.0 |
+---------+-------+
7. Find the top 3 users based on their total number of reviews:
SQL code used to arrive at answer:
SELECT review_count, name
FROM user
ORDER BY review_count DESC
LIMIT 3
+--------------+--------+
| review_count | name |
+--------------+--------+
| 2000 | Gerald |
| 1629 | Sara |
| 1339 | Yuri |
+--------------+--------+
8. Does posing more reviews correlate with more fans?
Please explain your findings and interpretation of the
results:
Not necessarily correlated. Amy, who has the most
fans, only has 609 reviews.
Yuri has only 76 fans, but has the
third most reviews.This factor alone is not sufficient for arriving at decision.
***********SQL CODE**********
SELECT name,review_count,fans
FROM user
ORDER BY fans DESC
*****************************
+-----------+--------------+------+
| name | review_count | fans |
+-----------+--------------+------+
| Amy | 609 | 503 |
| Mimi | 968 | 497 |
| Harald | 1153 | 311 |
| Gerald | 2000 | 253 |
| Christine | 930 | 173 |
| Lisa | 813 | 159 |
| Cat | 377 | 133 |
| William | 1215 | 126 |
| Fran | 862 | 124 |
| Lissa | 834 | 120 |
| Mark | 861 | 115 |
| Tiffany | 408 | 111 |
| bernice | 255 | 105 |
| Roanna | 1039 | 104 |
| Angela | 694 | 101 |
| .Hon | 1246 | 101 |
| Ben | 307 | 96 |
| Linda | 584 | 89 |
| Christina | 842 | 85 |
| Jessica | 220 | 84 |
| Greg | 408 | 81 |
| Nieves | 178 | 80 |
| Sui | 754 | 78 |
| Yuri | 1339 | 76 |
| Nicole | 161 | 73 |
+-----------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)
9. Are there more reviews with the word "love" or with the word
"hate" in them?
Answer: Yes. There are 1780 reviews with “love” and 232
reviews with “hate”
SQL code used to arrive at answer:
SELECT COUNT(*)
FROM review
WHERE text LIKE "%love%"
+----------+
| COUNT(*) |
+----------+
| 1780 |
+----------+
SELECT COUNT(*)
FROM review
WHERE text LIKE “%hate%”
+----------+
| COUNT(*) |
+----------+
| 232 |
+----------+
10. Find the top 10 users with the most fans:
SQL code used to arrive at answer:
SELECT name,fans
FROM user
ORDER BY fans DESC
LIMIT 10
Copy and Paste the Result Below:
+-----------+------+
| name | fans |
+-----------+------+
| Amy | 503 |
| Mimi | 497 |
| Harald | 311 |
| Gerald | 253 |
| Christine | 173 |
| Lisa | 159 |
| Cat | 133 |
| William | 126 |
| Fran | 124 |
| Lissa | 120 |
+-----------+------+
11. Is there a strong relationship (or correlation) between
having a high number of fans and being listed as "useful" or
"funny?" Out of the top 10 users with the highest number of
fans, what percent are also listed as “useful” or “funny”?
Key:
0% - 25% - Low relationship
26% - 75% - Medium relationship
76% - 100% - Strong relationship
SQL code used to arrive at answer:
SELECT name,fans,useful,funny
FROM user
ORDER BY fans DESC
LIMIT 10
Copy and Paste the Result Below:
+-----------+------+--------+--------+
| name | fans | useful | funny |
+-----------+------+--------+--------+
| Amy | 503 | 3226 | 2554 |
| Mimi | 497 | 257 | 138 |
| Harald | 311 | 122921 | 122419 |
| Gerald | 253 | 17524 | 2324 |
| Christine | 173 | 4834 | 6646 |
| Lisa | 159 | 48 | 13 |
| Cat | 133 | 1062 | 672 |
| William | 126 | 9363 | 9361 |
| Fran | 124 | 9851 | 7606 |
| Lissa | 120 | 455 | 150 |
+-----------+------+--------+--------+
Please explain your findings and interpretation of the
results:
Out of the top 10 users with the highest number of fans, 100%
are also listed as either “useful” or “funny”. So there is a
strong correlation between having a high number of fans and
being listed as “useful” or “funny”.
Part 2: Inferences and Analysis
1. Pick one city and category of your choice and group the
businesses in that city or category by their overall star
rating. Compare the businesses with 2-3 stars to the businesses
with 4-5 stars and answer the following questions. Include your
code.
i. Do the two groups you chose to analyze have a different
distribution of hours?
I picked Toronto and Bars for this question. Yes. The
restaurants with only 2.5 stars open from on at 10:00-2:00
Saturday. The places with higher rating stars open at 11:00-21:00
select a.city,b.category,a.stars,c.hours from business a ,category b,hours c where a.id=b.business_id and
a.id=c.business_id and a.city='Toronto' and b.category='Bars' group by a.stars;
+---------+----------+-------+----------------------+
| city | category | stars | hours |
+---------+----------+-------+----------------------+
| Toronto | Bars | 2.5 | Saturday|10:00-2:00 |
| Toronto | Bars | 3.5 | Saturday|18:00-2:00 |
| Toronto | Bars | 4.0 | Saturday|11:00-21:00 |
| Toronto | Bars | 4.5 | Saturday|16:00-2:00 |
ii. Do the two groups you chose to analyze have a different
number of reviews?
Yes. The group with 2-3 stars has more review (35) compared with
the group with higher rating stars.
select a.city,b.category,a.review_count,a.stars,c.hours from business a ,category b,hours c where
a.id=b.business_id and a.id=c.business_id and a.city='Toronto' and b.category='Bars' group by a.stars;
+---------+----------+--------------+-------+----------------------+
| city | category | review_count | stars | hours |
+---------+----------+--------------+-------+----------------------+
| Toronto | Bars | 35 | 2.5 | Saturday|10:00-2:00 |
| Toronto | Bars | 10 | 3.5 | Saturday|18:00-2:00 |
| Toronto | Bars | 15 | 4.0 | Saturday|11:00-21:00 |
| Toronto | Bars | 26 | 4.5 | Saturday|16:00-2:00 |
+---------+----------+--------------+-------+----------------------+
iii. Are you able to infer anything from the location data
provided between these two groups? Explain.
No . They have different locations.
select a.city,b.category,a.review_count,a.stars,c.hours,a.postal_code from business a ,category b,hours
c where a.id=b.business_id and a.id=c.business_id and a.city='Toronto' and b.category='Bars' group by
a.stars;
+---------+----------+--------------+-------+----------------------+-------------+
| city | category | review_count | stars | hours | postal_code |
+---------+----------+--------------+-------+----------------------+-------------+
| Toronto | Bars | 35 | 2.5 | Saturday|10:00-2:00 | M4K 1P7 |
| Toronto | Bars | 10 | 3.5 | Saturday|18:00-2:00 | M5V 2H5 |
| Toronto | Bars | 15 | 4.0 | Saturday|11:00-21:00 | M6H 1V5 |
| Toronto | Bars | 26 | 4.5 | Saturday|16:00-2:00 | M6P 1A6 |
+---------+----------+--------------+-------+----------------------+-------------+
2. Group business based on the ones that are open and the ones that
are closed. What
differences can you find between the ones that are still open and the
ones that are
closed? List at least two differences and the SQL code you used to
arrive at your
answer.
i. Difference 1:The businesses that are open tend to have more reviews than ones that are closed on average.
Open: AVG(review_count) = 31.757
Closed: AVG(review_count0 = 23.198
ii.Difference 2:The average star rating is higher for businesses that are open than businesses that are closed.Open:
AVG(stars) = 3.679 Closed: AVG(stars) = 3.520
SQL code used for analysis:
SELECT COUNT(DISTINCT(id)),
AVG(review_count),
SUM(review_count),
AVG(stars),
is_open
FROM business
GROUP BY is_open
+---------------------+-------------------+-------------------+---------------+-------
--+
| COUNT(DISTINCT(id)) | AVG(review_count) | SUM(review_count) | AVG(stars) | is_open
|
+---------------------+-------------------+-------------------+---------------+-------
--+
| 1520 | 23.1980263158 | 35261 | 3.52039473684 | 0 |
| 8480 | 31.7570754717 | 269300 | 3.67900943396 | 1 |
+---------------------+-------------------+-------------------+---------------+-------
--+
3. For this last part of your analysis, you are going to choose
the type of analysis you want to conduct on the Yelp dataset and
are going to prepare the data for analysis.
Ideas for analysis include: Parsing out keywords and business
attributes for sentiment analysis, clustering businesses to find
commonalities or anomalies between them, predicting the overall
star rating for a business, predicting the number of fans a user
will have, and so on. These are just a few examples to get you
started, so feel free to be creative and come up with your own
problem you want to solve. Provide answers, in-line, to all of
the following:
i. Indicate the type of analysis you chose to do:
I want to see if there is a correlation between the number of reviews a person writes and the number of fans
he/she has
The table above shows the top ten users that have the most fans
ii. Write 1-2 brief paragraphs on the type of data you will need
for your analysis and why you chose that data:
I discovered from the dataset , not really. It is true to say that more review count you have the more possibility to make more
fans, but the two categories are not positively correlated, meaning the more review count won’t guarantee more number of
fans.
iii. Output of your finished dataset:
+------------------------+-----------+------+--------------+
| id | name | fans | review_count |
+------------------------+-----------+------+--------------+
| -9I98YbNQnLdAmcYfb324Q | Amy | 503 | 609 |
| -8EnCioUmDygAbsYZmTeRQ | Mimi | 497 | 968 |
| --2vR0DIsmQ6WfcSzKWigw | Harald | 311 | 1153 |
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 253 | 2000 |
| -0IiMAZI2SsQ7VmyzJjokQ | Christine | 173 | 930 |
| -g3XIcCb2b-BD0QBCcq2Sw | Lisa | 159 | 813 |
| -9bbDysuiWeo2VShFJJtcw | Cat | 133 | 377 |
| -FZBTkAZEXoP7CYvRV2ZwQ | William | 126 | 1215 |
| -9da1xk7zgnnfO1uTVYGkA | Fran | 124 | 862 |
| -lh59ko3dxChBSZ9U7LfUw | Lissa | 120 | 834 |
+------------------------+-----------+------+--------------+
iv. Provide the SQL code you used to create your final dataset:
SELECT id, name, fans, review_count
FROM user
ORDER BY fans desc
limit 10;