- Foundation Understanding Databases
- Basic SELECT Queries
- Filtering and Conditions
- Sorting and Limiting Results
- Working with Multiple Tables
- Grouping and Aggregation
- Advanced Filtering with Subqueries
- Window Functions
- Data Modification
- Advanced Techniques
- Performance and Optimization
- Practice Exercises
Before we dive into writing queries, let's establish what we're working with. Think of a database like a digital filing cabinet, but infinitely more organized and powerful.
A relational database stores information in tables, much like spreadsheets, but with strict rules about how data relates to each other. Each table represents a specific type of entity (like customers, orders, or products), and tables can reference each other through relationships.
Key Concepts:
- Table: A collection of related data organized in rows and columns
- Row (Record): A single entry in a table representing one instance of the entity
- Column (Field): A specific attribute or property of the entity
- Primary Key: A unique identifier for each row in a table
- Foreign Key: A reference to a primary key in another table, creating relationships
Throughout this course, we'll use a fictional e-commerce database with these tables:
-- Customers table
customers (
customer_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
registration_date DATE,
city VARCHAR(50),
country VARCHAR(50)
)
-- Products table
products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2),
stock_quantity INT,
supplier_id INT
)
-- Orders table
orders (
order_id INT PRIMARY KEY,
customer_id INT, -- Foreign key to customers
order_date DATE,
total_amount DECIMAL(10,2),
status VARCHAR(20)
)
-- Order_items table (junction table for many-to-many relationship)
order_items (
order_item_id INT PRIMARY KEY,
order_id INT, -- Foreign key to orders
product_id INT, -- Foreign key to products
quantity INT,
unit_price DECIMAL(10,2)
)Think of this structure like a real business: customers place orders, orders contain multiple products, and each product has its own details. The relationships between these tables mirror real-world connections.
The SELECT statement is your primary tool for retrieving information from a database. Think of it as asking the database a question - you specify what information you want and from which table.
SELECT column1, column2, column3 -- What data do you want?
FROM table_name; -- Which table contains this data?Let's start with the most basic query - retrieving all data from a table:
-- Get all information about all customers
SELECT * FROM customers;The asterisk (*) is a wildcard that means "give me all columns." While convenient for exploration, it's generally better to specify exactly which columns you need.
-- Get just the names and email addresses of customers
SELECT first_name, last_name, email
FROM customers;This approach has several advantages: it's faster (less data transferred), clearer about your intentions, and more maintainable if the table structure changes.
You can rename columns in your output using aliases, which is especially useful for calculated fields or when column names aren't user-friendly:
-- Create more readable column headers
SELECT
first_name AS "First Name",
last_name AS "Last Name",
email AS "Email Address"
FROM customers;SQL can perform calculations on numeric data:
-- Calculate total value of each product in stock
SELECT
product_name,
price,
stock_quantity,
price * stock_quantity AS total_inventory_value
FROM products;Notice how we created a new calculated column. The database multiplies the price by stock quantity for each row and displays the result under our chosen alias.
Real-world scenarios rarely require all data from a table. The WHERE clause allows you to specify conditions that rows must meet to be included in your results.
-- Find customers from a specific city
SELECT first_name, last_name, city
FROM customers
WHERE city = 'New York';SQL provides various operators for different types of comparisons:
-- Products with price greater than $50
SELECT product_name, price
FROM products
WHERE price > 50.00;
-- Orders from 2024 or later
SELECT order_id, order_date, total_amount
FROM orders
WHERE order_date >= '2024-01-01';
-- Products that are NOT in the Electronics category
SELECT product_name, category
FROM products
WHERE category != 'Electronics'; -- or use <> instead of !=Text comparisons in SQL are case-sensitive by default, and you have several options for pattern matching:
-- Exact match (case-sensitive)
SELECT * FROM customers
WHERE last_name = 'Smith';
-- Pattern matching with LIKE
SELECT * FROM customers
WHERE last_name LIKE 'Sm%'; -- Names starting with 'Sm'
-- Case-insensitive search (using UPPER or LOWER)
SELECT * FROM customers
WHERE UPPER(last_name) = 'SMITH';LIKE Pattern Wildcards:
%matches any sequence of characters (including zero characters)_matches exactly one character
-- Names with exactly 5 characters
SELECT * FROM customers WHERE first_name LIKE '_____';
-- Email addresses from Gmail
SELECT * FROM customers WHERE email LIKE '%@gmail.com';
-- Products with 'phone' anywhere in the name
SELECT * FROM products WHERE LOWER(product_name) LIKE '%phone%';Real queries often need multiple conditions. SQL provides logical operators to combine them:
-- AND: Both conditions must be true
SELECT product_name, price, category
FROM products
WHERE price > 100 AND category = 'Electronics';
-- OR: Either condition can be true
SELECT customer_id, first_name, last_name
FROM customers
WHERE city = 'New York' OR city = 'Los Angeles';
-- Complex combinations using parentheses
SELECT product_name, price, category
FROM products
WHERE (price > 100 AND category = 'Electronics')
OR (price > 200 AND category = 'Clothing');Think of parentheses like mathematical equations - they control the order of evaluation and make your intentions clear.
NULL represents missing or unknown data, and it requires special handling:
-- Find customers without a city listed
SELECT first_name, last_name
FROM customers
WHERE city IS NULL;
-- Find customers WITH a city listed
SELECT first_name, last_name, city
FROM customers
WHERE city IS NOT NULL;Important: You cannot use = NULL or != NULL. NULL comparisons always require IS NULL or IS NOT NULL.
When you need to check if a value matches any item in a list, IN is more concise than multiple OR conditions:
-- Traditional approach with OR
SELECT * FROM products
WHERE category = 'Electronics' OR category = 'Clothing' OR category = 'Books';
-- More elegant approach with IN
SELECT * FROM products
WHERE category IN ('Electronics', 'Clothing', 'Books');
-- NOT IN for exclusion
SELECT * FROM products
WHERE category NOT IN ('Electronics', 'Clothing');For checking if a value falls within a range:
-- Products priced between $20 and $100
SELECT product_name, price
FROM products
WHERE price BETWEEN 20.00 AND 100.00;
-- Orders from the first quarter of 2024
SELECT order_id, order_date, total_amount
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';BETWEEN is inclusive on both ends, meaning the boundary values are included in the results.
Once you've filtered your data, you'll often want to control how it's presented and how much of it you see.
-- Sort customers alphabetically by last name
SELECT first_name, last_name, email
FROM customers
ORDER BY last_name;
-- Sort products by price, highest first
SELECT product_name, price
FROM products
ORDER BY price DESC; -- DESC for descending, ASC for ascending (default)You can sort by multiple columns, with each subsequent column serving as a "tie-breaker":
-- Sort by category first, then by price within each category
SELECT product_name, category, price
FROM products
ORDER BY category ASC, price DESC;This query groups all products by category alphabetically, and within each category, shows the most expensive products first.
When dealing with large datasets, you often want just a subset of results:
-- Get the 5 most expensive products
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 5;
-- Get products 11-20 when sorted by price (pagination)
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 10 OFFSET 10; -- Skip first 10, then take next 10Note: Different database systems use different syntax for limiting results:
- MySQL/PostgreSQL:
LIMIT norLIMIT n OFFSET m - SQL Server:
TOP norOFFSET m ROWS FETCH NEXT n ROWS ONLY - Oracle:
ROWNUM <= norFETCH FIRST n ROWS ONLY
Real-world data is spread across multiple related tables. Joins allow you to combine data from different tables based on their relationships.
Before diving into joins, let's understand how tables relate:
- A customer can have many orders (one-to-many)
- An order can contain many products, and a product can be in many orders (many-to-many, handled through the order_items table)
An INNER JOIN returns only rows where matching records exist in both tables:
-- Get customer information along with their orders
SELECT
c.first_name,
c.last_name,
o.order_id,
o.order_date,
o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;Key Points:
- Table aliases (
cfor customers,ofor orders) make queries more readable - The ON clause specifies how tables are related
- Only customers who have placed orders will appear in results
Sometimes you want all records from one table, even if they don't have matches in the other:
-- Get all customers, including those who haven't placed orders
SELECT
c.first_name,
c.last_name,
c.email,
o.order_id,
o.order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name;Customers without orders will show NULL values for the order columns. This is useful for finding inactive customers or analyzing customer engagement.
-- RIGHT JOIN: All orders, even if customer data is missing (rare in practice)
SELECT
c.first_name,
c.last_name,
o.order_id,
o.total_amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;
-- FULL OUTER JOIN: All customers and all orders (not supported in all databases)
SELECT
c.first_name,
c.last_name,
o.order_id,
o.total_amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;Real queries often involve three or more tables:
-- Get detailed order information: customer, order, and product details
SELECT
c.first_name,
c.last_name,
o.order_date,
p.product_name,
oi.quantity,
oi.unit_price,
(oi.quantity * oi.unit_price) AS line_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
ORDER BY c.last_name, o.order_date;This query tells a complete story: who bought what, when, and for how much.
Sometimes a table needs to be joined with itself. This is common with hierarchical data:
-- If we had an employees table with manager relationships
SELECT
e.first_name AS employee_name,
m.first_name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;Aggregation allows you to summarize data - counting records, calculating totals, finding averages, and more.
-- Count total number of customers
SELECT COUNT(*) AS total_customers
FROM customers;
-- Count customers with email addresses (excludes NULLs)
SELECT COUNT(email) AS customers_with_email
FROM customers;
-- Basic statistics on product prices
SELECT
COUNT(*) AS total_products,
MIN(price) AS cheapest_price,
MAX(price) AS most_expensive_price,
AVG(price) AS average_price,
SUM(stock_quantity) AS total_inventory_units
FROM products;GROUP BY allows you to create summaries for each group of related records:
-- Count customers by city
SELECT
city,
COUNT(*) AS customer_count
FROM customers
WHERE city IS NOT NULL
GROUP BY city
ORDER BY customer_count DESC;
-- Total sales by product category
SELECT
p.category,
COUNT(oi.order_item_id) AS items_sold,
SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.category
ORDER BY total_revenue DESC;Think of GROUP BY this way: Imagine sorting all your data into separate piles based on the grouping column(s), then calculating statistics for each pile.
WHERE filters individual rows before grouping, but HAVING filters groups after aggregation:
-- Find cities with more than 5 customers
SELECT
city,
COUNT(*) AS customer_count
FROM customers
WHERE city IS NOT NULL
GROUP BY city
HAVING COUNT(*) > 5
ORDER BY customer_count DESC;
-- Product categories with average price over $50
SELECT
category,
COUNT(*) AS product_count,
AVG(price) AS average_price
FROM products
GROUP BY category
HAVING AVG(price) > 50.00;-- Monthly sales summary for 2024
SELECT
YEAR(o.order_date) AS year,
MONTH(o.order_date) AS month,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.total_amount) AS monthly_revenue
FROM orders o
WHERE o.order_date >= '2024-01-01'
GROUP BY YEAR(o.order_date), MONTH(o.order_date)
ORDER BY year, month;
-- Customer purchase behavior analysis
SELECT
c.customer_id,
c.first_name,
c.last_name,
COUNT(o.order_id) AS total_orders,
SUM(o.total_amount) AS total_spent,
AVG(o.total_amount) AS average_order_value,
MIN(o.order_date) AS first_order_date,
MAX(o.order_date) AS last_order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY total_spent DESC;Subqueries are queries within queries, allowing you to use the results of one query as input to another. They're powerful tools for complex filtering and data analysis.
-- Find products that cost more than the average price
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
-- Find customers who have placed orders
SELECT first_name, last_name
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
);The subquery in parentheses executes first, and its result is used by the outer query.
In correlated subqueries, the inner query references columns from the outer query:
-- Find customers whose latest order was in the last 30 days
SELECT c.first_name, c.last_name, c.email
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= CURRENT_DATE - INTERVAL 30 DAY
);
-- Find products that have never been ordered
SELECT product_name, price
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.product_id = p.product_id
);EXISTS vs IN: EXISTS is often more efficient and handles NULL values better than IN, especially with correlated subqueries.
You can use subqueries to add calculated columns:
-- Show each customer with their total number of orders
SELECT
c.first_name,
c.last_name,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id) AS total_orders,
(SELECT MAX(order_date)
FROM orders o
WHERE o.customer_id = c.customer_id) AS last_order_date
FROM customers c
ORDER BY total_orders DESC;CTEs provide a cleaner way to write complex queries with subqueries:
-- Find customers who spent more than the average customer
WITH customer_totals AS (
SELECT
c.customer_id,
c.first_name,
c.last_name,
SUM(o.total_amount) AS total_spent
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
),
overall_average AS (
SELECT AVG(total_spent) AS avg_spent
FROM customer_totals
)
SELECT
ct.first_name,
ct.last_name,
ct.total_spent,
oa.avg_spent
FROM customer_totals ct
CROSS JOIN overall_average oa
WHERE ct.total_spent > oa.avg_spent
ORDER BY ct.total_spent DESC;CTEs make complex logic more readable by breaking it into named, reusable pieces.
Window functions perform calculations across a set of rows related to the current row, without collapsing the results into groups like aggregate functions do.
-- Add row numbers to products ordered by price
SELECT
product_name,
price,
ROW_NUMBER() OVER (ORDER BY price DESC) as price_rank
FROM products;
-- Show each order with running total of sales
SELECT
order_id,
order_date,
total_amount,
SUM(total_amount) OVER (ORDER BY order_date) as running_total
FROM orders
ORDER BY order_date;The PARTITION BY clause creates separate "windows" for different groups:
-- Rank products by price within each category
SELECT
product_name,
category,
price,
RANK() OVER (PARTITION BY category ORDER BY price DESC) as category_price_rank
FROM products;
-- Show each customer's orders with order sequence
SELECT
c.first_name,
c.last_name,
o.order_date,
o.total_amount,
ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;-- Compare each order to the previous order for the same customer
SELECT
c.first_name,
c.last_name,
o.order_date,
o.total_amount,
LAG(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as previous_order_amount,
o.total_amount - LAG(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as change_from_previous
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;
-- Find top 3 products in each category
SELECT *
FROM (
SELECT
product_name,
category,
price,
DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) as price_rank
FROM products
) ranked_products
WHERE price_rank <= 3
ORDER BY category, price_rank;Key Window Functions:
ROW_NUMBER(): Assigns unique sequential numbersRANK(): Assigns ranks with gaps for tiesDENSE_RANK(): Assigns ranks without gaps for tiesLAG()/LEAD(): Access previous/next row valuesFIRST_VALUE()/LAST_VALUE(): Get first/last values in window
Beyond querying data, SQL allows you to insert, update, and delete records.
-- Insert a single customer
INSERT INTO customers (first_name, last_name, email, registration_date, city, country)
VALUES ('John', 'Doe', '[email protected]', '2024-06-09', 'Chicago', 'USA');
-- Insert multiple customers at once
INSERT INTO customers (first_name, last_name, email, registration_date, city, country)
VALUES
('Jane', 'Smith', '[email protected]', '2024-06-09', 'Miami', 'USA'),
('Bob', 'Johnson', '[email protected]', '2024-06-09', 'Seattle', 'USA');
-- Insert from a query (copying data)
INSERT INTO archived_orders (order_id, customer_id, order_date, total_amount)
SELECT order_id, customer_id, order_date, total_amount
FROM orders
WHERE order_date < '2023-01-01';-- Update a single record
UPDATE customers
SET city = 'New Chicago', country = 'USA'
WHERE customer_id = 1;
-- Update multiple records with conditions
UPDATE products
SET price = price * 1.10 -- 10% price increase
WHERE category = 'Electronics';
-- Update using data from other tables
UPDATE customers c
SET city = 'Updated City'
WHERE c.customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
WHERE o.order_date >= '2024-01-01'
);Warning: Always use WHERE clauses with UPDATE statements unless you intend to modify every row in the table.
-- Delete specific records
DELETE FROM customers
WHERE registration_date IS NULL;
-- Delete based on related data
DELETE FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM order_items
);
-- Delete all records (use with extreme caution)
DELETE FROM temp_table;Some databases support "upsert" operations (insert or update):
-- MySQL example: INSERT ... ON DUPLICATE KEY UPDATE
INSERT INTO products (product_id, product_name, price)
VALUES (1, 'Updated Product', 29.99)
ON DUPLICATE KEY UPDATE
product_name = 'Updated Product',
price = 29.99;
-- PostgreSQL example: INSERT ... ON CONFLICT
INSERT INTO products (product_id, product_name, price)
VALUES (1, 'Updated Product', 29.99)
ON CONFLICT (product_id)
DO UPDATE SET
product_name = 'Updated Product',
price = 29.99;CASE statements allow you to implement conditional logic within queries:
-- Categorize customers by order frequency
SELECT
c.first_name,
c.last_name,
COUNT(o.order_id) as order_count,
CASE
WHEN COUNT(o.order_id) >= 10 THEN 'High Value'
WHEN COUNT(o.order_id) >= 5 THEN 'Medium Value'
WHEN COUNT(o.order_id) >= 1 THEN 'Low Value'
ELSE 'No Orders'
END as customer_category
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC;
-- Create pivot-like reports
SELECT
category,
SUM(CASE WHEN price < 50 THEN 1 ELSE 0 END) as budget_products,
SUM(CASE WHEN price BETWEEN 50 AND 100 THEN 1 ELSE 0 END) as mid_range_products,
SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) as premium_products
FROM products
GROUP BY category;-- Extract date parts
SELECT
order_id,
order_date,
YEAR(order_date) as order_year,
MONTH(order_date) as order_month,
DAYNAME(order_date) as order_day_name,
QUARTER(order_date) as order_quarter
FROM orders;
-- Date arithmetic
SELECT
customer_id,
registration_date,
DATEDIFF(CURRENT_DATE, registration_date) as days_since_registration,
DATE_ADD(registration_date, INTERVAL 1 YEAR) as one_year_anniversary
FROM customers;
-- Time-based analysis
SELECT
DATE_TRUNC('month', order_date) as month,
COUNT(*) as orders_count,
SUM(total_amount) as monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;-- Text manipulation
SELECT
first_name,
last_name,
CONCAT(first_name, ' ', last_name) as full_name,
UPPER(email) as email_upper,
LENGTH(email) as email_length,
SUBSTRING(email, 1, POSITION('@' IN email) - 1) as username
FROM customers;
-- Pattern matching and replacement
SELECT
product_name,
REPLACE(product_name, 'iPhone', 'Phone') as generic_name,
CASE
WHEN product_name LIKE '%Pro%' THEN 'Professional'
WHEN product_name LIKE '%Mini%' THEN 'Compact'
ELSE 'Standard'
END as product_tier
FROM products;-- UNION: Combine results from multiple queries
SELECT city FROM customers WHERE country = 'USA'
UNION
SELECT city FROM suppliers WHERE country = 'USA';
-- INTERSECT: Find common values (not supported in all databases)
SELECT city FROM customers
INTERSECT
SELECT city FROM suppliers;
-- EXCEPT/MINUS: Find values in first query but not second
SELECT city FROM customers
EXCEPT
SELECT city FROM suppliers;Understanding query performance is crucial for working with large datasets.
Most databases provide tools to show how queries are executed:
-- Show execution plan (syntax varies by database)
EXPLAIN SELECT c.first_name, c.last_name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;Indexes speed up queries but slow down modifications:
-- Create indexes on frequently queried columns
CREATE INDEX idx_customers_email ON customers(email);
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_products_category_price ON products(category, price);
-- Composite indexes for multi-column queries
CREATE INDEX idx_order_items_lookup ON order_items(order_id, product_id);-- Use specific columns instead of SELECT *
SELECT customer_id, first_name, last_name -- Good
FROM customers;
SELECT * -- Avoid when possible
FROM customers;
-- Use LIMIT when you don't need all results
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 10; -- Only get top 10
-- Use EXISTS instead of IN for correlated subqueries
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS ( -- More efficient
SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
-- Instead of
SELECT c.first_name, c.last_name
FROM customers c
WHERE c.customer_id IN ( -- Can be slower
SELECT customer_id FROM orders
);-- Avoid functions on columns in WHERE clauses
-- Slow:
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Better:
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Be careful with wildcards at the beginning of LIKE patterns
-- Slow (can't use indexes):
SELECT * FROM customers WHERE last_name LIKE '%son';
-- Better (can use indexes):
SELECT * FROM customers WHERE last_name LIKE 'John%';-
Basic Selection: Write a query to find all products in the 'Electronics' category with a price under $100.
-
Customer Analysis: Find all customers who registered in 2024, sorted by registration date.
-
Order Summary: Count the total number of orders and calculate the total revenue from all orders.
-
Join Practice: Create a report showing customer names, their order dates, and order totals for orders placed in the last 6 months.
-
Grouping Challenge: Find the top 5 best-selling products by total quantity sold, including the product name, total quantity, and total revenue generated.
-
Subquery Practice: Find customers who have spent more than the average customer spending amount.
-
Window Functions: Create a report showing each customer's orders with a running total of their spending over time.
-
Complex Analysis: Find products that have been ordered in every month of 2024 (if any).
-
Performance Challenge: Write an optimized query to find the most popular product in each category for the current year.
- Business Intelligence: Create a comprehensive customer segmentation analysis that categorizes customers as:
- VIP: Top 10% by total spending
- Regular: Next 40% by total spending
- Occasional: Remaining customers with orders
- Inactive: Customers with no orders
Exercise 1: Basic Selection
SELECT product_name, price, category
FROM products
WHERE category = 'Electronics' AND price < 100
ORDER BY price;Exercise 2: Customer Analysis
SELECT first_name, last_name, email, registration_date
FROM customers
WHERE registration_date >= '2024-01-01' AND registration_date < '2025-01-01'
ORDER BY registration_date;Exercise 3: Order Summary
SELECT
COUNT(*) as total_orders,
SUM(total_amount) as total_revenue,
AVG(total_amount) as average_order_value
FROM orders;Exercise 4: Join Practice
SELECT
c.first_name,
c.last_name,
o.order_date,
o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 6 MONTH)
ORDER BY c.last_name, o.order_date;Exercise 5: Grouping Challenge
SELECT
p.product_name,
SUM(oi.quantity) as total_quantity_sold,
SUM(oi.quantity * oi.unit_price) as total_revenue
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name
ORDER BY total_quantity_sold DESC
LIMIT 5;Exercise 6: Subquery Practice
WITH customer_spending AS (
SELECT
c.customer_id,
c.first_name,
c.last_name,
SUM(o.total_amount) as total_spent
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
)
SELECT
first_name,
last_name,
total_spent
FROM customer_spending
WHERE total_spent > (
SELECT AVG(total_spent) FROM customer_spending
)
ORDER BY total_spent DESC;Exercise 7: Window Functions
SELECT
c.first_name,
c.last_name,
o.order_date,
o.total_amount,
SUM(o.total_amount) OVER (
PARTITION BY c.customer_id
ORDER BY o.order_date
ROWS UNBOUNDED PRECEDING
) as running_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;Exercise 8: Complex Analysis
WITH monthly_products AS (
SELECT DISTINCT
p.product_id,
p.product_name,
MONTH(o.order_date) as order_month
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
INNER JOIN orders o ON oi.order_id = o.order_id
WHERE YEAR(o.order_date) = 2024
),
product_month_counts AS (
SELECT
product_id,
product_name,
COUNT(DISTINCT order_month) as months_sold
FROM monthly_products
GROUP BY product_id, product_name
)
SELECT product_name
FROM product_month_counts
WHERE months_sold = 12; -- All 12 monthsExercise 10: Business Intelligence - Customer Segmentation
WITH customer_totals AS (
SELECT
c.customer_id,
c.first_name,
c.last_name,
c.email,
COALESCE(SUM(o.total_amount), 0) as total_spent,
COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email
),
spending_percentiles AS (
SELECT
*,
PERCENT_RANK() OVER (ORDER BY total_spent DESC) as spending_percentile
FROM customer_totals
WHERE total_spent > 0
)
SELECT
customer_id,
first_name,
last_name,
email,
total_spent,
order_count,
CASE
WHEN total_spent = 0 THEN 'Inactive'
WHEN spending_percentile <= 0.10 THEN 'VIP'
WHEN spending_percentile <= 0.50 THEN 'Regular'
ELSE 'Occasional'
END as customer_segment,
ROUND(spending_percentile * 100, 1) as spending_percentile_rank
FROM (
SELECT
ct.*,
COALESCE(sp.spending_percentile, 1.0) as spending_percentile
FROM customer_totals ct
LEFT JOIN spending_percentiles sp ON ct.customer_id = sp.customer_id
) segmented_customers
ORDER BY total_spent DESC;-- Daily sales dashboard
WITH daily_metrics AS (
SELECT
DATE(order_date) as sale_date,
COUNT(DISTINCT order_id) as orders,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(total_amount) as revenue,
AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY DATE(order_date)
)
SELECT
sale_date,
orders,
unique_customers,
revenue,
avg_order_value,
revenue - LAG(revenue) OVER (ORDER BY sale_date) as revenue_change,
ROUND(
((revenue - LAG(revenue) OVER (ORDER BY sale_date)) /
LAG(revenue) OVER (ORDER BY sale_date)) * 100, 2
) as revenue_change_percent
FROM daily_metrics
ORDER BY sale_date DESC;-- Customer cohort analysis (simplified)
WITH customer_first_orders AS (
SELECT
customer_id,
MIN(order_date) as first_order_date,
DATE_FORMAT(MIN(order_date), '%Y-%m') as cohort_month
FROM orders
GROUP BY customer_id
),
monthly_activity AS (
SELECT
cfo.customer_id,
cfo.cohort_month,
DATE_FORMAT(o.order_date, '%Y-%m') as activity_month,
TIMESTAMPDIFF(MONTH, cfo.first_order_date, o.order_date) as period_number
FROM customer_first_orders cfo
INNER JOIN orders o ON cfo.customer_id = o.customer_id
)
SELECT
cohort_month,
period_number,
COUNT(DISTINCT customer_id) as customers
FROM monthly_activity
WHERE period_number <= 12 -- First 12 months
GROUP BY cohort_month, period_number
ORDER BY cohort_month, period_number;-- Products needing restock alert
WITH product_velocity AS (
SELECT
p.product_id,
p.product_name,
p.stock_quantity,
COALESCE(SUM(oi.quantity), 0) as units_sold_30_days,
COALESCE(SUM(oi.quantity) / 30.0, 0) as avg_daily_sales
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
AND o.order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY p.product_id, p.product_name, p.stock_quantity
)
SELECT
product_name,
stock_quantity,
units_sold_30_days,
ROUND(avg_daily_sales, 2) as avg_daily_sales,
CASE
WHEN avg_daily_sales > 0
THEN ROUND(stock_quantity / avg_daily_sales, 0)
ELSE 999
END as days_of_inventory,
CASE
WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 7 THEN 'URGENT'
WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 14 THEN 'LOW'
WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 30 THEN 'NORMAL'
ELSE 'HIGH'
END as inventory_status
FROM product_velocity
WHERE avg_daily_sales > 0
ORDER BY days_of_inventory ASC;-- MySQL date functions
SELECT
order_date,
DATE_FORMAT(order_date, '%Y-%m') as year_month,
WEEKDAY(order_date) as day_of_week,
STR_TO_DATE('2024-06-09', '%Y-%m-%d') as parsed_date;
-- MySQL string functions
SELECT
CONCAT(first_name, ' ', last_name) as full_name,
CHAR_LENGTH(email) as email_length,
SUBSTRING_INDEX(email, '@', 1) as username;-- PostgreSQL date functions
SELECT
order_date,
EXTRACT(YEAR FROM order_date) as year,
DATE_TRUNC('month', order_date) as month_start,
order_date + INTERVAL '30 days' as future_date;
-- PostgreSQL arrays and JSON (if supported)
SELECT
customer_id,
ARRAY_AGG(product_name) as purchased_products,
JSON_AGG(
JSON_BUILD_OBJECT(
'product', product_name,
'quantity', quantity
)
) as order_details
FROM customers c
JOIN orders o USING (customer_id)
JOIN order_items oi USING (order_id)
JOIN products p USING (product_id)
GROUP BY customer_id;-- SQL Server TOP and OFFSET/FETCH
SELECT TOP 10 * FROM products ORDER BY price DESC;
SELECT * FROM products
ORDER BY price DESC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY;
-- SQL Server date functions
SELECT
GETDATE() as current_datetime,
DATEPART(YEAR, order_date) as year,
DATEDIFF(DAY, order_date, GETDATE()) as days_ago;-
Use clear, descriptive aliases
-- Good SELECT c.first_name, c.last_name, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id; -- Avoid SELECT a.first_name, a.last_name, b.order_date FROM customers a, orders b WHERE a.customer_id = b.customer_id;
-
Format queries for readability
SELECT c.first_name, c.last_name, COUNT(o.order_id) as total_orders, SUM(o.total_amount) as total_spent FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE c.registration_date >= '2024-01-01' GROUP BY c.customer_id, c.first_name, c.last_name HAVING COUNT(o.order_id) > 0 ORDER BY total_spent DESC LIMIT 10;
-
Comment complex logic
-- Calculate customer lifetime value with 30-day recency weighting SELECT customer_id, total_spent * CASE WHEN days_since_last_order <= 30 THEN 1.0 WHEN days_since_last_order <= 90 THEN 0.8 ELSE 0.5 END as weighted_clv FROM customer_metrics;
-
Use clear, descriptive aliases
-- Good SELECT c.first_name, c.last_name, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id; -- Avoid SELECT a.first_name, a.last_name, b.order_date FROM customers a, orders b WHERE a.customer_id = b.customer_id;
-
Format queries for readability
SELECT c.first_name, c.last_name, COUNT(o.order_id) as total_orders, SUM(o.total_amount) as total_spent FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE c.registration_date >= '2024-01-01' GROUP BY c.customer_id, c.first_name, c.last_name HAVING COUNT(o.order_id) > 0 ORDER BY total_spent DESC LIMIT 10;
-
Comment complex logic
-- Calculate customer lifetime value with 30-day recency weighting SELECT customer_id, total_spent * CASE WHEN days_since_last_order <= 30 THEN 1.0 WHEN days_since_last_order <= 90 THEN 0.8 ELSE 0.5 END as weighted_clv FROM customer_metrics;
Understanding query performance requires thinking like the database engine. Every query goes through multiple phases: parsing, optimization, execution planning, and finally execution. Let's explore how to write queries that work with the optimizer rather than against it.
The most impactful performance optimization is proper indexing, but it's not just about "adding indexes to queried columns." The order of columns in composite indexes matters enormously, and understanding this can transform your query performance.
Composite Index Column Ordering
-- If you frequently query: WHERE customer_id = ? AND order_date BETWEEN ? AND ?
-- Create index in this specific order:
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- NOT: CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id);
-- The first version allows the database to quickly find all orders for a customer,
-- then scan just that subset for the date range
-- The second version would need to scan all orders in date range, then filter by customerThe rule here is to put the most selective columns first (columns that eliminate the most rows), followed by range conditions. Think of an index like a phone book - you can quickly find all "Smiths" and then scan through them for "John Smith," but you can't efficiently find all "Johns" without reading the entire book.
Covering Indexes: Eliminating Table Lookups
-- Instead of just indexing the WHERE clause columns:
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Include columns needed in SELECT to avoid going back to the table:
CREATE INDEX idx_orders_customer_covering ON orders(customer_id, order_date, total_amount);
-- Now this query can be satisfied entirely from the index:
SELECT order_date, total_amount
FROM orders
WHERE customer_id = 12345;This technique, called a "covering index," means the database never needs to access the actual table data after finding the index entries. It's particularly powerful for frequently-run reporting queries.
Sometimes the same logical query can be written in dramatically different ways with vastly different performance characteristics.
Transforming Correlated Subqueries
-- Slow: Correlated subquery that runs once per customer
SELECT c.first_name, c.last_name
FROM customers c
WHERE (
SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2024-01-01'
) > 5;
-- Fast: Join with aggregation (runs aggregation once, then joins)
SELECT c.first_name, c.last_name
FROM customers c
INNER JOIN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5
) frequent_customers ON c.customer_id = frequent_customers.customer_id;The first query executes the subquery thousands of times (once for each customer). The second query does the aggregation work once and then performs a simple join. With 10,000 customers, this could be the difference between 10,000 aggregations versus one.
EXISTS vs IN: When It Really Matters
-- When the subquery might return NULLs, EXISTS is not just faster but correct:
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- IN can behave unexpectedly with NULLs and is often slower:
SELECT c.first_name, c.last_name
FROM customers c
WHERE c.customer_id IN (
SELECT customer_id FROM orders -- If any customer_id is NULL, weird things happen
);The EXISTS version also allows the database to stop searching as soon as it finds one matching order, while IN might need to build the entire set of customer IDs first.
The order in which you write your joins can significantly impact performance, especially with complex multi-table queries.
Join Order Strategy
-- Less efficient: Starting with the largest table
SELECT p.product_name, c.category_name, s.supplier_name
FROM products p -- 1 million rows
INNER JOIN categories c ON p.category_id = c.category_id -- 50 rows
INNER JOIN suppliers s ON p.supplier_id = s.supplier_id; -- 1000 rows
-- More efficient: Start with smaller, more selective tables
SELECT p.product_name, c.category_name, s.supplier_name
FROM categories c -- 50 rows - start here
INNER JOIN products p ON c.category_id = p.category_id
INNER JOIN suppliers s ON p.supplier_id = s.supplier_id
WHERE c.category_name = 'Electronics'; -- Very selective conditionWhile modern optimizers often reorder joins automatically, understanding this principle helps you write queries that work with the optimizer rather than forcing it to work harder.
Choosing Between JOIN Types Based on Data Distribution
-- When you need all customers regardless of orders, but want order info where available:
-- LEFT JOIN is appropriate and efficient
SELECT c.first_name, c.last_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;
-- But if you know most customers have orders, this might be faster:
-- Get customers with orders, then UNION customers without orders
SELECT c.first_name, c.last_name, COUNT(o.order_id) as order_count
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
UNION ALL
SELECT c.first_name, c.last_name, 0 as order_count
FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);Function Calls in WHERE Clauses
-- Performance killer: Function prevents index usage
SELECT * FROM orders
WHERE YEAR(order_date) = 2024; -- Index on order_date cannot be used
-- Index-friendly version:
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01'; -- Index on order_date can be used efficientlyWhen you wrap a column in a function, the database can't use indexes on that column because it would need to calculate the function result for every row to use the index.
Implicit Type Conversions
-- Hidden performance problem: If customer_id is INT but you pass a string
SELECT * FROM orders WHERE customer_id = '12345'; -- Implicit conversion
-- Better: Match the data type exactly
SELECT * FROM orders WHERE customer_id = 12345; -- Direct comparison
-- Even worse: This forces conversion of ALL customer_id values
SELECT * FROM orders WHERE CAST(customer_id AS VARCHAR) = '12345';Premature DISTINCT Usage
-- Expensive: DISTINCT requires sorting/hashing entire result set
SELECT DISTINCT c.first_name, c.last_name
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
-- Often better: Use EXISTS to avoid duplicates in the first place
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);Window functions are powerful but can be resource-intensive. Understanding their performance characteristics helps you use them wisely.
Partitioning Strategy
-- Less efficient: Large partitions mean more sorting work
SELECT
product_name,
price,
ROW_NUMBER() OVER (ORDER BY price) as overall_rank -- Sorts ALL products
FROM products;
-- More efficient: Smaller partitions reduce sorting overhead
SELECT
product_name,
category,
price,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price) as category_rank -- Sorts within category
FROM products;Frame Specification Impact
-- Expensive: UNBOUNDED PRECEDING with large datasets
SELECT
order_date,
total_amount,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS UNBOUNDED PRECEDING -- Processes all previous rows for each row
) as running_total
FROM orders;
-- More efficient for recent data analysis: Limited window
SELECT
order_date,
total_amount,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS 29 PRECEDING -- Only look at last 30 days
) as rolling_30_day_total
FROM orders
WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 90 DAY);Learning to read execution plans is crucial for performance tuning. Here's what to look for:
Identifying Expensive Operations
-- Use EXPLAIN to see the execution plan
EXPLAIN SELECT c.first_name, c.last_name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.registration_date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, c.last_name;In the execution plan, watch for these red flags:
- Table scans on large tables (should use indexes)
- Hash joins on large datasets (nested loop joins might be better with proper indexes)
- Sorting operations on large result sets (consider if you really need ORDER BY)
- High row count estimates that don't match reality (statistics might be outdated)
Understanding Cost Estimates The database optimizer makes decisions based on statistics about your data. If these statistics are wrong, the optimizer makes poor choices. Regular statistics updates are crucial:
-- Update table statistics (syntax varies by database)
ANALYZE TABLE customers;
UPDATE STATISTICS customers;This is especially important after large data loads or significant changes to data distribution.
- Use parameterized queries to prevent SQL injection
- Implement proper access controls at the database level
- Audit sensitive operations like DELETE and UPDATE
- Regular backups before major data modifications