Complete SQL Query Revision

Foundation: Understanding Databases

Before we dive into writing queries, let's establish what we're working with. Think of a database like a digital filing cabinet, but infinitely more organized and powerful.

What is a Relational Database?

A relational database stores information in tables, much like spreadsheets, but with strict rules about how data relates to each other. Each table represents a specific type of entity (like customers, orders, or products), and tables can reference each other through relationships.

Key Concepts:

Table: A collection of related data organized in rows and columns
Row (Record): A single entry in a table representing one instance of the entity
Column (Field): A specific attribute or property of the entity
Primary Key: A unique identifier for each row in a table
Foreign Key: A reference to a primary key in another table, creating relationships

Sample Database Schema

Throughout this course, we'll use a fictional e-commerce database with these tables:

-- Customers table
customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    registration_date DATE,
    city VARCHAR(50),
    country VARCHAR(50)
)

-- Products table
products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2),
    stock_quantity INT,
    supplier_id INT
)

-- Orders table
orders (
    order_id INT PRIMARY KEY,
    customer_id INT,  -- Foreign key to customers
    order_date DATE,
    total_amount DECIMAL(10,2),
    status VARCHAR(20)
)

-- Order_items table (junction table for many-to-many relationship)
order_items (
    order_item_id INT PRIMARY KEY,
    order_id INT,     -- Foreign key to orders
    product_id INT,   -- Foreign key to products
    quantity INT,
    unit_price DECIMAL(10,2)
)

Think of this structure like a real business: customers place orders, orders contain multiple products, and each product has its own details. The relationships between these tables mirror real-world connections.

Basic SELECT Queries

The SELECT statement is your primary tool for retrieving information from a database. Think of it as asking the database a question - you specify what information you want and from which table.

The Anatomy of a SELECT Statement

SELECT column1, column2, column3    -- What data do you want?
FROM table_name;                    -- Which table contains this data?

Your First Query

Let's start with the most basic query - retrieving all data from a table:

-- Get all information about all customers
SELECT * FROM customers;

The asterisk (*) is a wildcard that means "give me all columns." While convenient for exploration, it's generally better to specify exactly which columns you need.

Selecting Specific Columns

-- Get just the names and email addresses of customers
SELECT first_name, last_name, email
FROM customers;

This approach has several advantages: it's faster (less data transferred), clearer about your intentions, and more maintainable if the table structure changes.

Column Aliases - Making Output Readable

You can rename columns in your output using aliases, which is especially useful for calculated fields or when column names aren't user-friendly:

-- Create more readable column headers
SELECT
    first_name AS "First Name",
    last_name AS "Last Name",
    email AS "Email Address"
FROM customers;

Simple Calculations

SQL can perform calculations on numeric data:

-- Calculate total value of each product in stock
SELECT
    product_name,
    price,
    stock_quantity,
    price * stock_quantity AS total_inventory_value
FROM products;

Notice how we created a new calculated column. The database multiplies the price by stock quantity for each row and displays the result under our chosen alias.

Filtering and Conditions

Real-world scenarios rarely require all data from a table. The WHERE clause allows you to specify conditions that rows must meet to be included in your results.

Basic WHERE Conditions

-- Find customers from a specific city
SELECT first_name, last_name, city
FROM customers
WHERE city = 'New York';

Comparison Operators

SQL provides various operators for different types of comparisons:

-- Products with price greater than $50
SELECT product_name, price
FROM products
WHERE price > 50.00;

-- Orders from 2024 or later
SELECT order_id, order_date, total_amount
FROM orders
WHERE order_date >= '2024-01-01';

-- Products that are NOT in the Electronics category
SELECT product_name, category
FROM products
WHERE category != 'Electronics';  -- or use <> instead of !=

Working with Text Data

Text comparisons in SQL are case-sensitive by default, and you have several options for pattern matching:

-- Exact match (case-sensitive)
SELECT * FROM customers
WHERE last_name = 'Smith';

-- Pattern matching with LIKE
SELECT * FROM customers
WHERE last_name LIKE 'Sm%';  -- Names starting with 'Sm'

-- Case-insensitive search (using UPPER or LOWER)
SELECT * FROM customers
WHERE UPPER(last_name) = 'SMITH';

LIKE Pattern Wildcards:

% matches any sequence of characters (including zero characters)
_ matches exactly one character

-- Names with exactly 5 characters
SELECT * FROM customers WHERE first_name LIKE '_____';

-- Email addresses from Gmail
SELECT * FROM customers WHERE email LIKE '%@gmail.com';

-- Products with 'phone' anywhere in the name
SELECT * FROM products WHERE LOWER(product_name) LIKE '%phone%';

Combining Conditions

Real queries often need multiple conditions. SQL provides logical operators to combine them:

-- AND: Both conditions must be true
SELECT product_name, price, category
FROM products
WHERE price > 100 AND category = 'Electronics';

-- OR: Either condition can be true
SELECT customer_id, first_name, last_name
FROM customers
WHERE city = 'New York' OR city = 'Los Angeles';

-- Complex combinations using parentheses
SELECT product_name, price, category
FROM products
WHERE (price > 100 AND category = 'Electronics')
   OR (price > 200 AND category = 'Clothing');

Think of parentheses like mathematical equations - they control the order of evaluation and make your intentions clear.

Working with NULL Values

NULL represents missing or unknown data, and it requires special handling:

-- Find customers without a city listed
SELECT first_name, last_name
FROM customers
WHERE city IS NULL;

-- Find customers WITH a city listed
SELECT first_name, last_name, city
FROM customers
WHERE city IS NOT NULL;

Important: You cannot use = NULL or != NULL. NULL comparisons always require IS NULL or IS NOT NULL.

The IN Operator

When you need to check if a value matches any item in a list, IN is more concise than multiple OR conditions:

-- Traditional approach with OR
SELECT * FROM products
WHERE category = 'Electronics' OR category = 'Clothing' OR category = 'Books';

-- More elegant approach with IN
SELECT * FROM products
WHERE category IN ('Electronics', 'Clothing', 'Books');

-- NOT IN for exclusion
SELECT * FROM products
WHERE category NOT IN ('Electronics', 'Clothing');

Range Queries with BETWEEN

For checking if a value falls within a range:

-- Products priced between $20 and $100
SELECT product_name, price
FROM products
WHERE price BETWEEN 20.00 AND 100.00;

-- Orders from the first quarter of 2024
SELECT order_id, order_date, total_amount
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';

BETWEEN is inclusive on both ends, meaning the boundary values are included in the results.

Sorting and Limiting Results

Once you've filtered your data, you'll often want to control how it's presented and how much of it you see.

Sorting with ORDER BY

-- Sort customers alphabetically by last name
SELECT first_name, last_name, email
FROM customers
ORDER BY last_name;

-- Sort products by price, highest first
SELECT product_name, price
FROM products
ORDER BY price DESC;  -- DESC for descending, ASC for ascending (default)

Multi-Level Sorting

You can sort by multiple columns, with each subsequent column serving as a "tie-breaker":

-- Sort by category first, then by price within each category
SELECT product_name, category, price
FROM products
ORDER BY category ASC, price DESC;

This query groups all products by category alphabetically, and within each category, shows the most expensive products first.

Limiting Results

When dealing with large datasets, you often want just a subset of results:

-- Get the 5 most expensive products
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 5;

-- Get products 11-20 when sorted by price (pagination)
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 10 OFFSET 10;  -- Skip first 10, then take next 10

Note: Different database systems use different syntax for limiting results:

MySQL/PostgreSQL: LIMIT n or LIMIT n OFFSET m
SQL Server: TOP n or OFFSET m ROWS FETCH NEXT n ROWS ONLY
Oracle: ROWNUM <= n or FETCH FIRST n ROWS ONLY

Working with Multiple Tables

Real-world data is spread across multiple related tables. Joins allow you to combine data from different tables based on their relationships.

Understanding Relationships

Before diving into joins, let's understand how tables relate:

A customer can have many orders (one-to-many)
An order can contain many products, and a product can be in many orders (many-to-many, handled through the order_items table)

INNER JOIN - Finding Matching Records

An INNER JOIN returns only rows where matching records exist in both tables:

-- Get customer information along with their orders
SELECT
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;

Key Points:

Table aliases (c for customers, o for orders) make queries more readable
The ON clause specifies how tables are related
Only customers who have placed orders will appear in results

LEFT JOIN - Including All Records from the First Table

Sometimes you want all records from one table, even if they don't have matches in the other:

-- Get all customers, including those who haven't placed orders
SELECT
    c.first_name,
    c.last_name,
    c.email,
    o.order_id,
    o.order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name;

Customers without orders will show NULL values for the order columns. This is useful for finding inactive customers or analyzing customer engagement.

RIGHT JOIN and FULL OUTER JOIN

-- RIGHT JOIN: All orders, even if customer data is missing (rare in practice)
SELECT
    c.first_name,
    c.last_name,
    o.order_id,
    o.total_amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;

-- FULL OUTER JOIN: All customers and all orders (not supported in all databases)
SELECT
    c.first_name,
    c.last_name,
    o.order_id,
    o.total_amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;

Joining Multiple Tables

Real queries often involve three or more tables:

-- Get detailed order information: customer, order, and product details
SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
ORDER BY c.last_name, o.order_date;

This query tells a complete story: who bought what, when, and for how much.

Self-Joins

Sometimes a table needs to be joined with itself. This is common with hierarchical data:

-- If we had an employees table with manager relationships
SELECT
    e.first_name AS employee_name,
    m.first_name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

Grouping and Aggregation

Aggregation allows you to summarize data - counting records, calculating totals, finding averages, and more.

Basic Aggregate Functions

-- Count total number of customers
SELECT COUNT(*) AS total_customers
FROM customers;

-- Count customers with email addresses (excludes NULLs)
SELECT COUNT(email) AS customers_with_email
FROM customers;

-- Basic statistics on product prices
SELECT
    COUNT(*) AS total_products,
    MIN(price) AS cheapest_price,
    MAX(price) AS most_expensive_price,
    AVG(price) AS average_price,
    SUM(stock_quantity) AS total_inventory_units
FROM products;

GROUP BY - Summarizing by Categories

GROUP BY allows you to create summaries for each group of related records:

-- Count customers by city
SELECT
    city,
    COUNT(*) AS customer_count
FROM customers
WHERE city IS NOT NULL
GROUP BY city
ORDER BY customer_count DESC;

-- Total sales by product category
SELECT
    p.category,
    COUNT(oi.order_item_id) AS items_sold,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.category
ORDER BY total_revenue DESC;

Think of GROUP BY this way: Imagine sorting all your data into separate piles based on the grouping column(s), then calculating statistics for each pile.

HAVING - Filtering Groups

WHERE filters individual rows before grouping, but HAVING filters groups after aggregation:

-- Find cities with more than 5 customers
SELECT
    city,
    COUNT(*) AS customer_count
FROM customers
WHERE city IS NOT NULL
GROUP BY city
HAVING COUNT(*) > 5
ORDER BY customer_count DESC;

-- Product categories with average price over $50
SELECT
    category,
    COUNT(*) AS product_count,
    AVG(price) AS average_price
FROM products
GROUP BY category
HAVING AVG(price) > 50.00;

Complex Grouping Examples

-- Monthly sales summary for 2024
SELECT
    YEAR(o.order_date) AS year,
    MONTH(o.order_date) AS month,
    COUNT(DISTINCT o.order_id) AS total_orders,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    SUM(o.total_amount) AS monthly_revenue
FROM orders o
WHERE o.order_date >= '2024-01-01'
GROUP BY YEAR(o.order_date), MONTH(o.order_date)
ORDER BY year, month;

-- Customer purchase behavior analysis
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_spent,
    AVG(o.total_amount) AS average_order_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY total_spent DESC;

Advanced Filtering with Subqueries

Subqueries are queries within queries, allowing you to use the results of one query as input to another. They're powerful tools for complex filtering and data analysis.

Simple Subqueries

-- Find products that cost more than the average price
SELECT product_name, price
FROM products
WHERE price > (
    SELECT AVG(price)
    FROM products
);

-- Find customers who have placed orders
SELECT first_name, last_name
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
);

The subquery in parentheses executes first, and its result is used by the outer query.

Correlated Subqueries

In correlated subqueries, the inner query references columns from the outer query:

-- Find customers whose latest order was in the last 30 days
SELECT c.first_name, c.last_name, c.email
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= CURRENT_DATE - INTERVAL 30 DAY
);

-- Find products that have never been ordered
SELECT product_name, price
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.product_id = p.product_id
);

EXISTS vs IN: EXISTS is often more efficient and handles NULL values better than IN, especially with correlated subqueries.

Subqueries in SELECT Clauses

You can use subqueries to add calculated columns:

-- Show each customer with their total number of orders
SELECT
    c.first_name,
    c.last_name,
    (SELECT COUNT(*)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS total_orders,
    (SELECT MAX(order_date)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS last_order_date
FROM customers c
ORDER BY total_orders DESC;

Common Table Expressions (CTEs)

CTEs provide a cleaner way to write complex queries with subqueries:

-- Find customers who spent more than the average customer
WITH customer_totals AS (
    SELECT
        c.customer_id,
        c.first_name,
        c.last_name,
        SUM(o.total_amount) AS total_spent
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name
),
overall_average AS (
    SELECT AVG(total_spent) AS avg_spent
    FROM customer_totals
)
SELECT
    ct.first_name,
    ct.last_name,
    ct.total_spent,
    oa.avg_spent
FROM customer_totals ct
CROSS JOIN overall_average oa
WHERE ct.total_spent > oa.avg_spent
ORDER BY ct.total_spent DESC;

CTEs make complex logic more readable by breaking it into named, reusable pieces.

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the results into groups like aggregate functions do.

Basic Window Functions

-- Add row numbers to products ordered by price
SELECT
    product_name,
    price,
    ROW_NUMBER() OVER (ORDER BY price DESC) as price_rank
FROM products;

-- Show each order with running total of sales
SELECT
    order_id,
    order_date,
    total_amount,
    SUM(total_amount) OVER (ORDER BY order_date) as running_total
FROM orders
ORDER BY order_date;

Partitioning with OVER

The PARTITION BY clause creates separate "windows" for different groups:

-- Rank products by price within each category
SELECT
    product_name,
    category,
    price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) as category_price_rank
FROM products;

-- Show each customer's orders with order sequence
SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount,
    ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;

Advanced Window Functions

-- Compare each order to the previous order for the same customer
SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount,
    LAG(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as previous_order_amount,
    o.total_amount - LAG(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as change_from_previous
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date;

-- Find top 3 products in each category
SELECT *
FROM (
    SELECT
        product_name,
        category,
        price,
        DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) as price_rank
    FROM products
) ranked_products
WHERE price_rank <= 3
ORDER BY category, price_rank;

Key Window Functions:

ROW_NUMBER(): Assigns unique sequential numbers
RANK(): Assigns ranks with gaps for ties
DENSE_RANK(): Assigns ranks without gaps for ties
LAG()/LEAD(): Access previous/next row values
FIRST_VALUE()/LAST_VALUE(): Get first/last values in window

Data Modification

Beyond querying data, SQL allows you to insert, update, and delete records.

INSERT - Adding New Data

-- Insert a single customer
INSERT INTO customers (first_name, last_name, email, registration_date, city, country)
VALUES ('John', 'Doe', '[email protected]', '2024-06-09', 'Chicago', 'USA');

-- Insert multiple customers at once
INSERT INTO customers (first_name, last_name, email, registration_date, city, country)
VALUES
    ('Jane', 'Smith', '[email protected]', '2024-06-09', 'Miami', 'USA'),
    ('Bob', 'Johnson', '[email protected]', '2024-06-09', 'Seattle', 'USA');

-- Insert from a query (copying data)
INSERT INTO archived_orders (order_id, customer_id, order_date, total_amount)
SELECT order_id, customer_id, order_date, total_amount
FROM orders
WHERE order_date < '2023-01-01';

UPDATE - Modifying Existing Data

-- Update a single record
UPDATE customers
SET city = 'New Chicago', country = 'USA'
WHERE customer_id = 1;

-- Update multiple records with conditions
UPDATE products
SET price = price * 1.10  -- 10% price increase
WHERE category = 'Electronics';

-- Update using data from other tables
UPDATE customers c
SET city = 'Updated City'
WHERE c.customer_id IN (
    SELECT DISTINCT o.customer_id
    FROM orders o
    WHERE o.order_date >= '2024-01-01'
);

Warning: Always use WHERE clauses with UPDATE statements unless you intend to modify every row in the table.

DELETE - Removing Data

-- Delete specific records
DELETE FROM customers
WHERE registration_date IS NULL;

-- Delete based on related data
DELETE FROM products
WHERE product_id NOT IN (
    SELECT DISTINCT product_id
    FROM order_items
);

-- Delete all records (use with extreme caution)
DELETE FROM temp_table;

UPSERT Operations

Some databases support "upsert" operations (insert or update):

-- MySQL example: INSERT ... ON DUPLICATE KEY UPDATE
INSERT INTO products (product_id, product_name, price)
VALUES (1, 'Updated Product', 29.99)
ON DUPLICATE KEY UPDATE
    product_name = 'Updated Product',
    price = 29.99;

-- PostgreSQL example: INSERT ... ON CONFLICT
INSERT INTO products (product_id, product_name, price)
VALUES (1, 'Updated Product', 29.99)
ON CONFLICT (product_id)
DO UPDATE SET
    product_name = 'Updated Product',
    price = 29.99;

Advanced Techniques

CASE Statements - Conditional Logic

CASE statements allow you to implement conditional logic within queries:

-- Categorize customers by order frequency
SELECT
    c.first_name,
    c.last_name,
    COUNT(o.order_id) as order_count,
    CASE
        WHEN COUNT(o.order_id) >= 10 THEN 'High Value'
        WHEN COUNT(o.order_id) >= 5 THEN 'Medium Value'
        WHEN COUNT(o.order_id) >= 1 THEN 'Low Value'
        ELSE 'No Orders'
    END as customer_category
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC;

-- Create pivot-like reports
SELECT
    category,
    SUM(CASE WHEN price < 50 THEN 1 ELSE 0 END) as budget_products,
    SUM(CASE WHEN price BETWEEN 50 AND 100 THEN 1 ELSE 0 END) as mid_range_products,
    SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) as premium_products
FROM products
GROUP BY category;

Working with Dates and Times

-- Extract date parts
SELECT
    order_id,
    order_date,
    YEAR(order_date) as order_year,
    MONTH(order_date) as order_month,
    DAYNAME(order_date) as order_day_name,
    QUARTER(order_date) as order_quarter
FROM orders;

-- Date arithmetic
SELECT
    customer_id,
    registration_date,
    DATEDIFF(CURRENT_DATE, registration_date) as days_since_registration,
    DATE_ADD(registration_date, INTERVAL 1 YEAR) as one_year_anniversary
FROM customers;

-- Time-based analysis
SELECT
    DATE_TRUNC('month', order_date) as month,
    COUNT(*) as orders_count,
    SUM(total_amount) as monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

String Functions

-- Text manipulation
SELECT
    first_name,
    last_name,
    CONCAT(first_name, ' ', last_name) as full_name,
    UPPER(email) as email_upper,
    LENGTH(email) as email_length,
    SUBSTRING(email, 1, POSITION('@' IN email) - 1) as username
FROM customers;

-- Pattern matching and replacement
SELECT
    product_name,
    REPLACE(product_name, 'iPhone', 'Phone') as generic_name,
    CASE
        WHEN product_name LIKE '%Pro%' THEN 'Professional'
        WHEN product_name LIKE '%Mini%' THEN 'Compact'
        ELSE 'Standard'
    END as product_tier
FROM products;

Set Operations

-- UNION: Combine results from multiple queries
SELECT city FROM customers WHERE country = 'USA'
UNION
SELECT city FROM suppliers WHERE country = 'USA';

-- INTERSECT: Find common values (not supported in all databases)
SELECT city FROM customers
INTERSECT
SELECT city FROM suppliers;

-- EXCEPT/MINUS: Find values in first query but not second
SELECT city FROM customers
EXCEPT
SELECT city FROM suppliers;

Performance and Optimization

Understanding query performance is crucial for working with large datasets.

Query Execution Plans

Most databases provide tools to show how queries are executed:

-- Show execution plan (syntax varies by database)
EXPLAIN SELECT c.first_name, c.last_name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

Indexing Strategy

Indexes speed up queries but slow down modifications:

-- Create indexes on frequently queried columns
CREATE INDEX idx_customers_email ON customers(email);
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_products_category_price ON products(category, price);

-- Composite indexes for multi-column queries
CREATE INDEX idx_order_items_lookup ON order_items(order_id, product_id);

Query Optimization Tips

-- Use specific columns instead of SELECT *
SELECT customer_id, first_name, last_name  -- Good
FROM customers;

SELECT *  -- Avoid when possible
FROM customers;

-- Use LIMIT when you don't need all results
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 10;  -- Only get top 10

-- Use EXISTS instead of IN for correlated subqueries
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (  -- More efficient
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

-- Instead of
SELECT c.first_name, c.last_name
FROM customers c
WHERE c.customer_id IN (  -- Can be slower
    SELECT customer_id FROM orders
);

Common Performance Pitfalls

-- Avoid functions on columns in WHERE clauses
-- Slow:
SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- Better:
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

-- Be careful with wildcards at the beginning of LIKE patterns
-- Slow (can't use indexes):
SELECT * FROM customers WHERE last_name LIKE '%son';

-- Better (can use indexes):
SELECT * FROM customers WHERE last_name LIKE 'John%';

Practice Exercises

Beginner Level

Basic Selection: Write a query to find all products in the 'Electronics' category with a price under $100.
Customer Analysis: Find all customers who registered in 2024, sorted by registration date.
Order Summary: Count the total number of orders and calculate the total revenue from all orders.

Intermediate Level

Join Practice: Create a report showing customer names, their order dates, and order totals for orders placed in the last 6 months.
Grouping Challenge: Find the top 5 best-selling products by total quantity sold, including the product name, total quantity, and total revenue generated.
Subquery Practice: Find customers who have spent more than the average customer spending amount.

Advanced Level

Window Functions: Create a report showing each customer's orders with a running total of their spending over time.
Complex Analysis: Find products that have been ordered in every month of 2024 (if any).
Performance Challenge: Write an optimized query to find the most popular product in each category for the current year.

Expert Level

Business Intelligence: Create a comprehensive customer segmentation analysis that categorizes customers as:

VIP: Top 10% by total spending
Regular: Next 40% by total spending
Occasional: Remaining customers with orders
Inactive: Customers with no orders

Solutions to Practice Exercises

Beginner Solutions

Exercise 1: Basic Selection

SELECT product_name, price, category
FROM products
WHERE category = 'Electronics' AND price < 100
ORDER BY price;

Exercise 2: Customer Analysis

SELECT first_name, last_name, email, registration_date
FROM customers
WHERE registration_date >= '2024-01-01' AND registration_date < '2025-01-01'
ORDER BY registration_date;

Exercise 3: Order Summary

SELECT
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as average_order_value
FROM orders;

Intermediate Solutions

Exercise 4: Join Practice

SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 6 MONTH)
ORDER BY c.last_name, o.order_date;

Exercise 5: Grouping Challenge

SELECT
    p.product_name,
    SUM(oi.quantity) as total_quantity_sold,
    SUM(oi.quantity * oi.unit_price) as total_revenue
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name
ORDER BY total_quantity_sold DESC
LIMIT 5;

Exercise 6: Subquery Practice

WITH customer_spending AS (
    SELECT
        c.customer_id,
        c.first_name,
        c.last_name,
        SUM(o.total_amount) as total_spent
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name
)
SELECT
    first_name,
    last_name,
    total_spent
FROM customer_spending
WHERE total_spent > (
    SELECT AVG(total_spent) FROM customer_spending
)
ORDER BY total_spent DESC;

Advanced Solutions

Exercise 7: Window Functions

SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount,
    SUM(o.total_amount) OVER (
        PARTITION BY c.customer_id
        ORDER BY o.order_date
        ROWS UNBOUNDED PRECEDING
    ) as running_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;

Exercise 8: Complex Analysis

WITH monthly_products AS (
    SELECT DISTINCT
        p.product_id,
        p.product_name,
        MONTH(o.order_date) as order_month
    FROM products p
    INNER JOIN order_items oi ON p.product_id = oi.product_id
    INNER JOIN orders o ON oi.order_id = o.order_id
    WHERE YEAR(o.order_date) = 2024
),
product_month_counts AS (
    SELECT
        product_id,
        product_name,
        COUNT(DISTINCT order_month) as months_sold
    FROM monthly_products
    GROUP BY product_id, product_name
)
SELECT product_name
FROM product_month_counts
WHERE months_sold = 12;  -- All 12 months

Expert Solution

Exercise 10: Business Intelligence - Customer Segmentation

WITH customer_totals AS (
    SELECT
        c.customer_id,
        c.first_name,
        c.last_name,
        c.email,
        COALESCE(SUM(o.total_amount), 0) as total_spent,
        COUNT(o.order_id) as order_count
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name, c.email
),
spending_percentiles AS (
    SELECT
        *,
        PERCENT_RANK() OVER (ORDER BY total_spent DESC) as spending_percentile
    FROM customer_totals
    WHERE total_spent > 0
)
SELECT
    customer_id,
    first_name,
    last_name,
    email,
    total_spent,
    order_count,
    CASE
        WHEN total_spent = 0 THEN 'Inactive'
        WHEN spending_percentile <= 0.10 THEN 'VIP'
        WHEN spending_percentile <= 0.50 THEN 'Regular'
        ELSE 'Occasional'
    END as customer_segment,
    ROUND(spending_percentile * 100, 1) as spending_percentile_rank
FROM (
    SELECT
        ct.*,
        COALESCE(sp.spending_percentile, 1.0) as spending_percentile
    FROM customer_totals ct
    LEFT JOIN spending_percentiles sp ON ct.customer_id = sp.customer_id
) segmented_customers
ORDER BY total_spent DESC;

Real-World Application Scenarios

E-commerce Analytics

-- Daily sales dashboard
WITH daily_metrics AS (
    SELECT
        DATE(order_date) as sale_date,
        COUNT(DISTINCT order_id) as orders,
        COUNT(DISTINCT customer_id) as unique_customers,
        SUM(total_amount) as revenue,
        AVG(total_amount) as avg_order_value
    FROM orders
    WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
    GROUP BY DATE(order_date)
)
SELECT
    sale_date,
    orders,
    unique_customers,
    revenue,
    avg_order_value,
    revenue - LAG(revenue) OVER (ORDER BY sale_date) as revenue_change,
    ROUND(
        ((revenue - LAG(revenue) OVER (ORDER BY sale_date)) /
         LAG(revenue) OVER (ORDER BY sale_date)) * 100, 2
    ) as revenue_change_percent
FROM daily_metrics
ORDER BY sale_date DESC;

Customer Retention Analysis

-- Customer cohort analysis (simplified)
WITH customer_first_orders AS (
    SELECT
        customer_id,
        MIN(order_date) as first_order_date,
        DATE_FORMAT(MIN(order_date), '%Y-%m') as cohort_month
    FROM orders
    GROUP BY customer_id
),
monthly_activity AS (
    SELECT
        cfo.customer_id,
        cfo.cohort_month,
        DATE_FORMAT(o.order_date, '%Y-%m') as activity_month,
        TIMESTAMPDIFF(MONTH, cfo.first_order_date, o.order_date) as period_number
    FROM customer_first_orders cfo
    INNER JOIN orders o ON cfo.customer_id = o.customer_id
)
SELECT
    cohort_month,
    period_number,
    COUNT(DISTINCT customer_id) as customers
FROM monthly_activity
WHERE period_number <= 12  -- First 12 months
GROUP BY cohort_month, period_number
ORDER BY cohort_month, period_number;

Inventory Management

-- Products needing restock alert
WITH product_velocity AS (
    SELECT
        p.product_id,
        p.product_name,
        p.stock_quantity,
        COALESCE(SUM(oi.quantity), 0) as units_sold_30_days,
        COALESCE(SUM(oi.quantity) / 30.0, 0) as avg_daily_sales
    FROM products p
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    LEFT JOIN orders o ON oi.order_id = o.order_id
        AND o.order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
    GROUP BY p.product_id, p.product_name, p.stock_quantity
)
SELECT
    product_name,
    stock_quantity,
    units_sold_30_days,
    ROUND(avg_daily_sales, 2) as avg_daily_sales,
    CASE
        WHEN avg_daily_sales > 0
        THEN ROUND(stock_quantity / avg_daily_sales, 0)
        ELSE 999
    END as days_of_inventory,
    CASE
        WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 7 THEN 'URGENT'
        WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 14 THEN 'LOW'
        WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 30 THEN 'NORMAL'
        ELSE 'HIGH'
    END as inventory_status
FROM product_velocity
WHERE avg_daily_sales > 0
ORDER BY days_of_inventory ASC;

Database-Specific Considerations

MySQL Specifics

-- MySQL date functions
SELECT
    order_date,
    DATE_FORMAT(order_date, '%Y-%m') as year_month,
    WEEKDAY(order_date) as day_of_week,
    STR_TO_DATE('2024-06-09', '%Y-%m-%d') as parsed_date;

-- MySQL string functions
SELECT
    CONCAT(first_name, ' ', last_name) as full_name,
    CHAR_LENGTH(email) as email_length,
    SUBSTRING_INDEX(email, '@', 1) as username;

PostgreSQL Specifics

-- PostgreSQL date functions
SELECT
    order_date,
    EXTRACT(YEAR FROM order_date) as year,
    DATE_TRUNC('month', order_date) as month_start,
    order_date + INTERVAL '30 days' as future_date;

-- PostgreSQL arrays and JSON (if supported)
SELECT
    customer_id,
    ARRAY_AGG(product_name) as purchased_products,
    JSON_AGG(
        JSON_BUILD_OBJECT(
            'product', product_name,
            'quantity', quantity
        )
    ) as order_details
FROM customers c
JOIN orders o USING (customer_id)
JOIN order_items oi USING (order_id)
JOIN products p USING (product_id)
GROUP BY customer_id;

SQL Server Specifics

-- SQL Server TOP and OFFSET/FETCH
SELECT TOP 10 * FROM products ORDER BY price DESC;

SELECT * FROM products
ORDER BY price DESC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY;

-- SQL Server date functions
SELECT
    GETDATE() as current_datetime,
    DATEPART(YEAR, order_date) as year,
    DATEDIFF(DAY, order_date, GETDATE()) as days_ago;

Best Practices Summary

Writing Maintainable SQL

Use clear, descriptive aliases

-- Good
SELECT c.first_name, c.last_name, o.order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

-- Avoid
SELECT a.first_name, a.last_name, b.order_date
FROM customers a, orders b
WHERE a.customer_id = b.customer_id;

Format queries for readability

SELECT
    c.first_name,
    c.last_name,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.registration_date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING COUNT(o.order_id) > 0
ORDER BY total_spent DESC
LIMIT 10;

Comment complex logic

-- Calculate customer lifetime value with 30-day recency weighting
SELECT
    customer_id,
    total_spent *
    CASE
        WHEN days_since_last_order <= 30 THEN 1.0
        WHEN days_since_last_order <= 90 THEN 0.8
        ELSE 0.5
    END as weighted_clv
FROM customer_metrics;

Best Practices Summary

Writing Maintainable SQL

Use clear, descriptive aliases

-- Good
SELECT c.first_name, c.last_name, o.order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

-- Avoid
SELECT a.first_name, a.last_name, b.order_date
FROM customers a, orders b
WHERE a.customer_id = b.customer_id;

Format queries for readability

SELECT 
    c.first_name,
    c.last_name,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.registration_date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING COUNT(o.order_id) > 0
ORDER BY total_spent DESC
LIMIT 10;

Comment complex logic

-- Calculate customer lifetime value with 30-day recency weighting
SELECT 
    customer_id,
    total_spent * 
    CASE 
        WHEN days_since_last_order <= 30 THEN 1.0
        WHEN days_since_last_order <= 90 THEN 0.8
        ELSE 0.5
    END as weighted_clv
FROM customer_metrics;

Advanced Performance Guidelines

Understanding query performance requires thinking like the database engine. Every query goes through multiple phases: parsing, optimization, execution planning, and finally execution. Let's explore how to write queries that work with the optimizer rather than against it.

Index Strategy: Beyond the Basics

The most impactful performance optimization is proper indexing, but it's not just about "adding indexes to queried columns." The order of columns in composite indexes matters enormously, and understanding this can transform your query performance.

Composite Index Column Ordering

-- If you frequently query: WHERE customer_id = ? AND order_date BETWEEN ? AND ?
-- Create index in this specific order:
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- NOT: CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id);
-- The first version allows the database to quickly find all orders for a customer,
-- then scan just that subset for the date range
-- The second version would need to scan all orders in date range, then filter by customer

The rule here is to put the most selective columns first (columns that eliminate the most rows), followed by range conditions. Think of an index like a phone book - you can quickly find all "Smiths" and then scan through them for "John Smith," but you can't efficiently find all "Johns" without reading the entire book.

Covering Indexes: Eliminating Table Lookups

-- Instead of just indexing the WHERE clause columns:
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Include columns needed in SELECT to avoid going back to the table:
CREATE INDEX idx_orders_customer_covering ON orders(customer_id, order_date, total_amount);

-- Now this query can be satisfied entirely from the index:
SELECT order_date, total_amount 
FROM orders 
WHERE customer_id = 12345;

This technique, called a "covering index," means the database never needs to access the actual table data after finding the index entries. It's particularly powerful for frequently-run reporting queries.

Query Rewriting for Performance

Sometimes the same logical query can be written in dramatically different ways with vastly different performance characteristics.

Transforming Correlated Subqueries

-- Slow: Correlated subquery that runs once per customer
SELECT c.first_name, c.last_name
FROM customers c
WHERE (
    SELECT COUNT(*) 
    FROM orders o 
    WHERE o.customer_id = c.customer_id 
    AND o.order_date >= '2024-01-01'
) > 5;

-- Fast: Join with aggregation (runs aggregation once, then joins)
SELECT c.first_name, c.last_name
FROM customers c
INNER JOIN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
    HAVING COUNT(*) > 5
) frequent_customers ON c.customer_id = frequent_customers.customer_id;

The first query executes the subquery thousands of times (once for each customer). The second query does the aggregation work once and then performs a simple join. With 10,000 customers, this could be the difference between 10,000 aggregations versus one.

EXISTS vs IN: When It Really Matters

-- When the subquery might return NULLs, EXISTS is not just faster but correct:
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id
);

-- IN can behave unexpectedly with NULLs and is often slower:
SELECT c.first_name, c.last_name
FROM customers c
WHERE c.customer_id IN (
    SELECT customer_id FROM orders  -- If any customer_id is NULL, weird things happen
);

The EXISTS version also allows the database to stop searching as soon as it finds one matching order, while IN might need to build the entire set of customer IDs first.

Join Optimization: Order and Type Matter

The order in which you write your joins can significantly impact performance, especially with complex multi-table queries.

Join Order Strategy

-- Less efficient: Starting with the largest table
SELECT p.product_name, c.category_name, s.supplier_name
FROM products p  -- 1 million rows
INNER JOIN categories c ON p.category_id = c.category_id  -- 50 rows
INNER JOIN suppliers s ON p.supplier_id = s.supplier_id;  -- 1000 rows

-- More efficient: Start with smaller, more selective tables
SELECT p.product_name, c.category_name, s.supplier_name
FROM categories c  -- 50 rows - start here
INNER JOIN products p ON c.category_id = p.category_id
INNER JOIN suppliers s ON p.supplier_id = s.supplier_id
WHERE c.category_name = 'Electronics';  -- Very selective condition

While modern optimizers often reorder joins automatically, understanding this principle helps you write queries that work with the optimizer rather than forcing it to work harder.

Choosing Between JOIN Types Based on Data Distribution

-- When you need all customers regardless of orders, but want order info where available:
-- LEFT JOIN is appropriate and efficient
SELECT c.first_name, c.last_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- But if you know most customers have orders, this might be faster:
-- Get customers with orders, then UNION customers without orders
SELECT c.first_name, c.last_name, COUNT(o.order_id) as order_count
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
UNION ALL
SELECT c.first_name, c.last_name, 0 as order_count
FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

Avoiding Performance Killers

Function Calls in WHERE Clauses

-- Performance killer: Function prevents index usage
SELECT * FROM orders 
WHERE YEAR(order_date) = 2024;  -- Index on order_date cannot be used

-- Index-friendly version:
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' 
AND order_date < '2025-01-01';  -- Index on order_date can be used efficiently

When you wrap a column in a function, the database can't use indexes on that column because it would need to calculate the function result for every row to use the index.

Implicit Type Conversions

-- Hidden performance problem: If customer_id is INT but you pass a string
SELECT * FROM orders WHERE customer_id = '12345';  -- Implicit conversion

-- Better: Match the data type exactly
SELECT * FROM orders WHERE customer_id = 12345;  -- Direct comparison

-- Even worse: This forces conversion of ALL customer_id values
SELECT * FROM orders WHERE CAST(customer_id AS VARCHAR) = '12345';

Premature DISTINCT Usage

-- Expensive: DISTINCT requires sorting/hashing entire result set
SELECT DISTINCT c.first_name, c.last_name
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

-- Often better: Use EXISTS to avoid duplicates in the first place
SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

Window Function Performance Considerations

Window functions are powerful but can be resource-intensive. Understanding their performance characteristics helps you use them wisely.

Partitioning Strategy

-- Less efficient: Large partitions mean more sorting work
SELECT 
    product_name,
    price,
    ROW_NUMBER() OVER (ORDER BY price) as overall_rank  -- Sorts ALL products
FROM products;

-- More efficient: Smaller partitions reduce sorting overhead
SELECT 
    product_name,
    category,
    price,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY price) as category_rank  -- Sorts within category
FROM products;

Frame Specification Impact

-- Expensive: UNBOUNDED PRECEDING with large datasets
SELECT 
    order_date,
    total_amount,
    SUM(total_amount) OVER (
        ORDER BY order_date 
        ROWS UNBOUNDED PRECEDING  -- Processes all previous rows for each row
    ) as running_total
FROM orders;

-- More efficient for recent data analysis: Limited window
SELECT 
    order_date,
    total_amount,
    SUM(total_amount) OVER (
        ORDER BY order_date 
        ROWS 29 PRECEDING  -- Only look at last 30 days
    ) as rolling_30_day_total
FROM orders
WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 90 DAY);

Query Plan Analysis and Optimization

Learning to read execution plans is crucial for performance tuning. Here's what to look for:

Identifying Expensive Operations

-- Use EXPLAIN to see the execution plan
EXPLAIN SELECT c.first_name, c.last_name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.registration_date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, c.last_name;

In the execution plan, watch for these red flags:

Table scans on large tables (should use indexes)
Hash joins on large datasets (nested loop joins might be better with proper indexes)
Sorting operations on large result sets (consider if you really need ORDER BY)
High row count estimates that don't match reality (statistics might be outdated)

Understanding Cost Estimates The database optimizer makes decisions based on statistics about your data. If these statistics are wrong, the optimizer makes poor choices. Regular statistics updates are crucial:

-- Update table statistics (syntax varies by database)
ANALYZE TABLE customers;
UPDATE STATISTICS customers;

This is especially important after large data loads or significant changes to data distribution.

Security Considerations

Use parameterized queries to prevent SQL injection
Implement proper access controls at the database level
Audit sensitive operations like DELETE and UPDATE
Regular backups before major data modifications

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

HappyLee25/SQL-revision

Folders and files

Latest commit

History

Repository files navigation