SQL FOR
DATA SCIENCE
AN INTRODUCTION
Let's start at the top...
SQL stands for Structured Query Language
Referred to as "S-Q-L" as well as "Sequel"
It is known to be the easiest programming
language to learn & use due to the "common
sense" nature of the commands
Used to store, extract & manipulate data in
relational databases
Relational Database?
A relational database is a collection of
tabular datasets (think columns & rows) that
relate to each other through shared columns
country_populations
continent country population
Europe UK 67,220,000
shared Europe Germany 83,240,000
column North America USA 329,500,000
... ... ...
continent_areas
shared
column
continent area
Europe UK country_temps
Europe Germany country max_temp
North America USA UK 38.7
... ... Germany 41.2
USA 56.7
... ...
database
What can we do?
A good way to think about what we can using SQL is with the
acronym C.R.U.D
Create Delete
C.R.U.D
Read
Update
While this might seem like a slightly informal acronym it’s
actually a really good way to describe the core functions or
operations that can be performed on a relational database...
Let's take a look!
C
CREATE: We can create databases, schemas
(which are almost like a partitioned area to help
keep things organised) and of course we can
create tables as well!
R
READ: This is mainly about querying the data,
so essentially grabbing the relevant rows
and columns from tables that will provide us
with the information we need
U
UPDATE: We can add more rows & columns to
tables that already exist, as well as modify
records within tables
D
DELETE: This is kinda what you’d expect - we
can delete specific rows and columns, or we
can delete whole tables, schemas and even
databases!
SQL in Data Science 1
While all of these C.R.U.D processes can be undertaken
using SQL - Data Scientists and Data Analysts will typically
spend most of their time in the "Read" area...
C.R.U.D
Read
In a lot of companies the management of the databases
themselves (so the Create, Update, and Delete functions)
are often taken care of by a specific database team, or by
Data Engineers.
In saying that however, a great Data Scientist or Analyst
should have an understanding of how the data they’re using
is being imported & created as well as how it’s being
managed and changed over time - so knowing at least the
fundamentals of the other functions can be very useful
SQL in Data Science 2
In Data Science - common tasks that use SQL will be...
Querying & exploring data to extract useful
business insights
Gathering & aggregating data for business
reporting
Selecting data for a specific treatment, e.g.
selecting customers to receive a targeted
promotion
Extracting data for Machine Learning tasks
or other predictive modelling
A simple code example...
We are the owner of Rolex, and we’re looking for a new
spokesperson for our very elite range of watches.
player_details
first_name last_name sport net_worth
Roger Federer Tennis $900m
Novak Djokovic Tennis $220m
Sachin Tendulkar Cricket $170m
Yao Ming Basketball $120m
LeBron James Basketball $500m
Lewis Hamilton Motorsport $280m
For our simple example, we a single table of data called
player_details that contains 6 famous sports people.
We want to create a shortlist of names who are worth over
$250m dollars - we only want the wealthiest of athletes
representing our product of course!
What would the SQL query for this look like?
A simple code example...
We use the SELECT statement to specify which
columns from the original dataset we want
returned. We only needed the names, so we've
listed those columns with a comma seperating
them
SELECT
first_name, We use the FROM
last_name statement to specify
FROM the name of the table
player_details that this information
resides in
WHERE
net_worth > $250m;
The WHERE statement is used to apply any row level
filters. Our only requirement was to limited the
results to sportspeople worth over $250m - so this is
where we apply that rule!
A simple code example...
player_details
first_name last_name sport net_worth
Roger Federer Tennis $900m
Novak Djokovic Tennis $220m
Sachin Tendulkar Cricket $170m
Yao Ming Basketball $120m
LeBron James Basketball $500m
Lewis Hamilton Motorsport $280m
SELECT first_name last_name
first_name, Roger Federer
last_name
LeBron James
FROM
Lewis Hamilton
player_details
WHERE
net_worth > $250m;
Voila! Our shortlist of
potential spokespeople
for our new range of
watches!
What else can we do?
Our example covered a very simple query - there is much,
much more flexibility with SQL that means we can do a
whole lot more in terms of processing and manipulating
data, such as...
Task SQL Clause
Find Unique Values DISTINCT
Merge Multiple Tables JOIN
SUM, MAX, COUNT
Aggregation
( + GROUP BY )
Appending UNION, UNION ALL
Conditional Logic CASE WHEN
Apply logic to a RANK, NTILE, LAG, LEAD
set of rows ( Window Functions)
Do you want to learn more
about this topic - and how
to apply it in the real
world?
Do you want to land an
incredible role in the
exciting, future-proof, and
lucrative field of Data
Science?
LEARN THE
RIGHT SKILLS
A curriculum based on
input from hundreds of
leaders, hiring managers,
and recruiters
https://data-science-infinity.teachable.com
BUILD YOUR
PORTFOLIO
Create your professionally
made portfolio site that
includes 10 pre-built
projects
https://data-science-infinity.teachable.com
EARN THE
CERTIFICATION
Prove your skills with the
DSI Data Science
Professional Certification
https://data-science-infinity.teachable.com
LAND AN
AMAZING ROLE
Get guidance & support
based upon hundreds of
interviews at top tech
companies
https://data-science-infinity.teachable.com
Taught by former Amazon
& Sony PlayStation Data
Scientist Andrew Jones
What do DSI
students say?
"I had over 40 interviews without an offer.
After DSI I quickly got 7 offers including
one at KPMG and my amazing new role
at Deloitte!"
- Ritesh
"The best program I've been a part of,
hands down"
- Christian
"DSI is incredible - everything is taught in
such a clear and simple way, even the
more complex concepts!"
- Arianna
"I got it! Thank you so much for all your
advice & help with preparation - it truly
gave me the confidence to go in and
land the job!"
- Marta
"I've taken a number of Data Science
courses, and without doubt, DSI is the
best"
- William
"One of the best purchases towards
learning I have ever made"
- Scott
"I learned more than on any other
course, or reading entire books!"
- Erick
"I started a bootcamp last summer
through a well respected University, but I
didn't learn half as much from them!"
- GA
"100% worth it, it is amazing. I have
never seen such a good course and I
have done plenty of them!"
- Khatuna
"This is a world-class Data Science
experience. I would recommend this
course to every aspiring or professional
Data Scientist"
- David
"Andrew's guidance with my Resume &
throughout the interview process helped
me land my amazing new role (and at a
much higher salary than I expected!)"
- Barun
"DSI is a fantastic community & Andrew
is one of the best instructors!"
- Keith
"I'm now at University, and my Data
Science related subjects are a piece of
cake after completing this course!
I'm so glad I enrolled!" - Jose
"In addition to the great content,
Andrew's dedication to the growing DSI
community is amazing"
- Sophie
"The course has such high quality
content - you get your ROI even from the
first module"
- Donabel
"The Statistics 101 section was awesome!
I have now started to get confidence in
Statistics!"
- Shrikant
"I can't emphasise how good this
programme is...well worth the
investment!"
- Dejan
"I'd completed my Master's in Business
Analytics, but DSI was the first time I felt
I had a solid foundation in Data Science
to go forward with"
- Scott
...come and join the
hundreds & hundreds of
other students getting the
results they want!
https://data-science-infinity.teachable.com