PWC DATA ANALYST EXPERIENCE (1-3 yoe)
Guesstimate Questions:
1. Estimate the number of smartphones sold in India annually.
To guesstimate the annual number of smartphones sold in India, we can break the problem
into logical steps using assumptions and available population data. Here's a structured
approach:
Step 1: Population of India
India's population is approximately 1.4 billion people.
Step 2: Target Population (Smartphone Users)
Not everyone in India uses or purchases a smartphone. Let's segment the population:
• Assume 70% of the population is in the age group 15-60, which is the primary
smartphone user base.
1.4billion×70
• Assume 70% of this group can afford a smartphone or actively use one.
980million×70
Step 3: Replacement Cycle and New Users
• On average, a smartphone is replaced every 2-3 years. This implies that
approximately one-third of smartphone users replace their phones annually.
686million÷3≈229million
• Assume an additional 5-10% of the target population (new users) buys a
smartphone each year. Using 10%: 686million×10
Step 4: Total Annual Sales
Adding the replacements and new users:
229million+68.6million≈300million
Final Estimate:
Approximately 300 million smartphones are sold annually in India.
Assumptions Recap:
1. 70% of the population is in the primary age group for smartphone users.
2. 70% of this segment can afford smartphones.
3. Replacement cycle is 2-3 years.
4. Annual new users account for ~10% of the total target population.
Refinement:
This estimate can be refined further with real data on:
• Smartphone penetration rates.
• Market growth rates.
• Urban vs. rural demographics.
2. How would you estimate the daily revenue generated by roadside tea stalls across
India?
To estimate the daily revenue generated by roadside tea stalls across India, let's approach
this systematically using assumptions and logical reasoning:
Step 1: Population of India
India's population is approximately 1.4 billion.
Step 2: Target Population for Tea Consumption
Not everyone consumes tea from roadside stalls. Let’s segment the population:
• Assume 70% of the population (adults and older teens) regularly drink tea.
1.4billion×70
• Out of these, assume 50% of tea drinkers consume tea from roadside stalls (the rest
may make tea at home, go to cafes, or other sources).
980million×50
Step 3: Daily Tea Consumption
• On average, tea drinkers consume 2 cups of tea daily.
• Not all cups are purchased from roadside stalls; assume 1 cup per person per day
is bought from such stalls. 490millioncupsperday490 million cups per
day490millioncupsperday
Step 4: Price of Tea
• The average price of tea at roadside stalls is approximately ₹10 per cup.
Step 5: Daily Revenue
• Multiply the daily consumption by the price per cup:
490millioncups×₹10=₹4.9billion
Final Estimate:
The daily revenue generated by roadside tea stalls across India is approximately ₹4.9
billion.
Assumptions Recap:
1. 70% of the population drinks tea.
2. 50% of tea drinkers buy from roadside stalls.
3. One cup per person is consumed daily at roadside stalls.
4. Average price per cup is ₹10.
Refinement:
To improve this estimate:
• Factor in rural vs. urban consumption patterns (higher urban roadside tea stall
density).
• Adjust for regional variations in tea prices and consumption habits.
• Account for occasional tea drinkers or seasonal demand changes.
Python Questions:
1. Write a Python function to find all unique pairs of integers in a list that sum up to a given
target value.
Find All Unique Pairs That Sum to a Target
def find_pairs(nums, target):
seen = set()
pairs = set()
for num in nums:
complement = target - num
if complement in seen:
pairs.add((min(num, complement), max(num, complement)))
seen.add(num)
return list(pairs)
# Example usage:
nums = [2, 4, 3, 7, 5, 8, -1]
target = 7
print(find_pairs(nums, target)) # Output: [(3, 4), (2, 5)]
2. Given a string, write a function to check if it’s a palindrome, ignoring spaces,
punctuation, and case sensitivity.
Check if a String Is a Palindrome
import string
def is_palindrome(s):
# Remove spaces, punctuation, and convert to lowercase
filtered = ''.join(c for c in s if c.isalnum()).lower()
return filtered == filtered[::-1]
# Example usage:
s = "A man, a plan, a canal, Panama!"
print(is_palindrome(s)) # Output: True
3. Explain the difference between deep copy and shallow copy in Python. When would you
use each?
Deep Copy vs. Shallow Copy
• Shallow Copy:
o Creates a new object but does not create copies of nested objects.
o Changes to mutable objects within the original will reflect in the copied
object.
o Example: Using copy.copy() or the copy() method of a list.
• Deep Copy:
o Creates a new object along with copies of all objects it contains, recursively.
o Changes to the original object do not affect the copied object.
o Example: Using copy.deepcopy().
Example:
import copy
original = [[1, 2], [3, 4]]
shallow = copy.copy(original)
deep = copy.deepcopy(original)
original[0][0] = 99
print(shallow) # Output: [[99, 2], [3, 4]]
print(deep) # Output: [[1, 2], [3, 4]]
Use Cases:
• Use shallow copy when you want to duplicate a structure but allow shared mutable
data.
• Use deep copy when creating a fully independent copy is necessary.
4.What are decorators in Python, and how do they work? Provide an example of a scenario
where a decorator would be useful.
Decorators in Python
Decorators are functions that modify the behavior of other functions or methods. They take
a function as input, add functionality to it, and return it.
Example of a Decorator:
def logger(func):
def wrapper(*args, **kwargs):
print(f"Calling {func.__name__} with {args} and {kwargs}")
result = func(*args, **kwargs)
print(f"{func.__name__} returned {result}")
return result
return wrapper
@logger
def add(a, b):
return a + b
# Example usage:
print(add(3, 5))
Output:
csharp
Copy code
Calling add with (3, 5) and {}
add returned 8
When to Use:
Decorators are useful for:
1. Logging: Automatically log function calls.
2. Authentication: Check user permissions before executing a function.
Caching: Store results of expensive computations for reuse .
SQL Questions:
1. Write a query to find the cumulative revenue by month for each product category in
a sales table.
Step 1: Create the Sales Table
CREATE TABLE sales (
id INT AUTO_INCREMENT PRIMARY KEY,
product_category VARCHAR(50),
revenue DECIMAL(10, 2),
sale_date DATE
);
Step 2: Insert Sample Records
INSERT INTO sales (product_category, revenue, sale_date) VALUES
('Electronics', 5000.00, '2024-01-15'),
('Electronics', 7000.00, '2024-02-10'),
('Electronics', 4000.00, '2024-03-05'),
('Clothing', 2000.00, '2024-01-20'),
('Clothing', 3000.00, '2024-02-15'),
('Clothing', 1500.00, '2024-03-01'),
('Groceries', 1000.00, '2024-01-10'),
('Groceries', 1200.00, '2024-02-12'),
('Groceries', 1300.00, '2024-03-08');
Step 3: Write the Query for Cumulative Revenue
SELECT
product_category,
DATE_FORMAT(sale_date, '%Y-%m') AS month,
SUM(revenue) AS monthly_revenue,
SUM(SUM(revenue)) OVER (PARTITION BY product_category ORDER BY
DATE_FORMAT(sale_date, '%Y-%m')) AS cumulative_revenue
FROM
sales
GROUP BY
product_category, DATE_FORMAT(sale_date, '%Y-%m')
ORDER BY
product_category, month;
Explanation:
1. DATE_FORMAT(sale_date, '%Y-%m'): Extracts the year and month from the
sale_date for grouping.
2. SUM(SUM(revenue)) OVER (PARTITION BY product_category ORDER BY
DATE_FORMAT(sale_date, '%Y-%m')): Calculates the cumulative revenue for each
product category by summing the monthly revenues in the specified order.
3. GROUP BY product_category, DATE_FORMAT(sale_date, '%Y-%m'): Groups the
data by product category and month.
Sample Output:
Product_Category Month Monthly_Revenue Cumulative_Revenue
Electronics 2024-01 5000.00 5000.00
Electronics 2024-02 7000.00 12000.00
Electronics 2024-03 4000.00 16000.00
Clothing 2024-01 2000.00 2000.00
Clothing 2024-02 3000.00 5000.00
Clothing 2024-03 1500.00 6500.00
Groceries 2024-01 1000.00 1000.00
Groceries 2024-02 1200.00 2200.00
Groceries 2024-03 1300.00 3500.00
2. How would you retrieve the top 5 products by sales volume, excluding any products that
had zero sales in the past 3 months?
Step 1: Create the Products Table
CREATE TABLE product_sales (
product_id INT AUTO_INCREMENT PRIMARY KEY,
product_name VARCHAR(50),
sales_volume INT,
sale_date DATE
);
Step 2: Insert Sample Records
INSERT INTO product_sales (product_name, sales_volume, sale_date) VALUES
('Product A', 150, '2024-10-01'),
('Product A', 200, '2024-11-01'),
('Product A', 180, '2024-12-01'),
('Product B', 100, '2024-10-01'),
('Product B', 0, '2024-11-01'),
('Product B', 50, '2024-12-01'),
('Product C', 250, '2024-10-15'),
('Product C', 300, '2024-11-15'),
('Product C', 400, '2024-12-15'),
('Product D', 0, '2024-10-10'),
('Product D', 0, '2024-11-10'),
('Product D', 0, '2024-12-10'),
('Product E', 500, '2024-10-05'),
('Product E', 600, '2024-11-05'),
('Product E', 700, '2024-12-05');
Step 3: Write the Query
WITH recent_sales AS (
SELECT
product_name,
SUM(sales_volume) AS total_sales,
MAX(CASE WHEN sale_date >= CURDATE() - INTERVAL 3 MONTH THEN sales_volume
ELSE 0 END) AS recent_sales_flag
FROM
product_sales
WHERE
sale_date >= CURDATE() - INTERVAL 3 MONTH
GROUP BY
product_name
),
valid_products AS (
SELECT
product_name,
total_sales
FROM
recent_sales
WHERE
recent_sales_flag > 0
SELECT
product_name,
total_sales
FROM
valid_products
ORDER BY
total_sales DESC
LIMIT 5;
Explanation:
1. recent_sales CTE:
o Calculates the total sales for each product.
o Uses CASE to flag whether a product had non-zero sales in the past 3
months.
2. valid_products CTE:
o Filters out products with zero sales in all the past 3 months using
recent_sales_flag > 0.
3. Final Query:
o Retrieves the top 5 products by total sales from valid_products.
o Orders the results in descending order of total sales and limits the output to
the top 5 products.
Expected Output:
Product_Name Total_Sales
Product E 1800
Product C 950
Product A 530
Product_Name Total_Sales
Product B 150
3. Given a table of customer transactions, identify all customers who made purchases in
two or more consecutive months.
To solve this, we'll assume the following table structure for customer transactions:
Step 1: Create the Transactions Table
CREATE TABLE customer_transactions (
transaction_id INT AUTO_INCREMENT PRIMARY KEY,
customer_id INT,
transaction_date DATE,
amount DECIMAL(10, 2)
);
Step 2: Insert Sample Records
INSERT INTO customer_transactions (customer_id, transaction_date, amount) VALUES
(1, '2024-01-15', 100.00),
(1, '2024-02-10', 200.00),
(1, '2024-04-05', 150.00),
(2, '2024-01-20', 300.00),
(2, '2024-02-15', 400.00),
(2, '2024-03-01', 500.00),
(3, '2024-01-25', 250.00),
(3, '2024-03-10', 300.00),
(4, '2024-02-05', 150.00),
(4, '2024-03-07', 200.00),
(4, '2024-04-15', 250.00);
Step 3: Write the Query
WITH monthly_transactions AS (
SELECT
customer_id,
DATE_FORMAT(transaction_date, '%Y-%m') AS transaction_month
FROM
customer_transactions
GROUP BY
customer_id, DATE_FORMAT(transaction_date, '%Y-%m')
),
consecutive_months AS (
SELECT
t1.customer_id,
t1.transaction_month AS month1,
t2.transaction_month AS month2
FROM
monthly_transactions t1
JOIN
monthly_transactions t2
ON
t1.customer_id = t2.customer_id
AND DATE_ADD(LAST_DAY(t1.transaction_month), INTERVAL 1 DAY) =
DATE(t2.transaction_month)
)
SELECT DISTINCT
customer_id
FROM
consecutive_months;
Explanation:
1. monthly_transactions CTE:
o Groups transactions by customer and month using
DATE_FORMAT(transaction_date, '%Y-%m').
o Ensures we have a unique list of months in which a customer made
purchases.
2. consecutive_months CTE:
o Joins monthly_transactions with itself to find customers with consecutive
months.
o Uses DATE_ADD(LAST_DAY(t1.transaction_month), INTERVAL 1 DAY) to
calculate the first day of the next month and checks if it matches
t2.transaction_month.
3. Final Query:
o Selects unique customer IDs from the consecutive_months CTE.
Sample Output:
Customer_ID
Notes:
• Customer 1: Purchased in January and February.
• Customer 2: Purchased in January, February, and March.
• Customer 4: Purchased in February, March, and April.
• Customer 3: Skipped February, so they are not included in the output.
4. Write a query to calculate the retention rate of users on a monthly basis.
Retention Rate Definition
The retention rate is the percentage of users who return in a subsequent month after their
initial activity.
Assumptions
• We have a table called user_activity with the following structure:
o user_id: Unique identifier for each user.
o activity_date: Date of the user's activity.
Step 1: Create the Table
CREATE TABLE user_activity (
user_id INT,
activity_date DATE
);
Step 2: Insert Sample Records
INSERT INTO user_activity (user_id, activity_date) VALUES
(1, '2024-01-15'),
(1, '2024-02-10'),
(1, '2024-03-20'),
(2, '2024-01-20'),
(2, '2024-02-15'),
(3, '2024-02-05'),
(3, '2024-03-10'),
(4, '2024-01-25'),
(5, '2024-02-18'),
(5, '2024-03-15'),
(6, '2024-03-01');
Step 3: Query to Calculate Retention Rate
WITH first_month_activity AS (
SELECT
user_id,
DATE_FORMAT(MIN(activity_date), '%Y-%m') AS first_active_month
FROM
user_activity
GROUP BY
user_id
),
monthly_retention AS (
SELECT
fma.first_active_month,
DATE_FORMAT(ua.activity_date, '%Y-%m') AS active_month,
COUNT(DISTINCT ua.user_id) AS retained_users
FROM
user_activity ua
JOIN
first_month_activity fma
ON
ua.user_id = fma.user_id
GROUP BY
fma.first_active_month, DATE_FORMAT(ua.activity_date, '%Y-%m')
),
monthly_cohort AS (
SELECT
first_active_month,
COUNT(DISTINCT user_id) AS cohort_size
FROM
first_month_activity
GROUP BY
first_active_month
SELECT
mr.first_active_month,
mr.active_month,
mr.retained_users,
mc.cohort_size,
ROUND((mr.retained_users / mc.cohort_size) * 100, 2) AS retention_rate
FROM
monthly_retention mr
JOIN
monthly_cohort mc
ON
mr.first_active_month = mc.first_active_month
ORDER BY
mr.first_active_month, mr.active_month;
Explanation
1. first_month_activity CTE:
o Determines the first active month for each user.
2. monthly_retention CTE:
o Counts the number of users retained for each combination of their first
active month and subsequent activity months.
3. monthly_cohort CTE:
o Calculates the size of the cohort for each first active month (the total number
of users who first became active in that month).
4. Final Query:
o Joins monthly_retention and monthly_cohort to calculate the retention rate
as: Retention Rate=(Retained UsersCohort Size)×100\text{Retention Rate} =
\left(\frac{\text{Retained Users}}{\text{Cohort Size}}\right) \times
100Retention Rate=(Cohort SizeRetained Users)×100
o Orders the results by the first active month and the active month.
Sample Output
First_Active_Month Active_Month Retained_Users Cohort_Size Retention_Rate
2024-01 2024-01 3 3 100.00
2024-01 2024-02 2 3 66.67
2024-01 2024-03 1 3 33.33
First_Active_Month Active_Month Retained_Users Cohort_Size Retention_Rate
2024-02 2024-02 3 3 100.00
2024-02 2024-03 2 3 66.67
2024-03 2024-03 1 1 100.00
Interpretation
• First Active Month: The cohort of users who became active in that month.
• Active Month: The months in which users returned.
• Retention Rate: The percentage of the cohort that returned in subsequent months.
5. Find the nth highest salary from an employee table, where n is a parameter passed
dynamically to the query.
To find the nth highest salary dynamically, we can use a subquery with the LIMIT clause.
The query involves ranking salaries in descending order, skipping the first n−1salaries, and
then retrieving the nth salary. Here's how:
Table Creation and Sample Data
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(50),
salary DECIMAL(10, 2)
);
INSERT INTO employees (employee_id, employee_name, salary) VALUES
(1, 'Alice', 60000.00),
(2, 'Bob', 75000.00),
(3, 'Charlie', 85000.00),
(4, 'David', 50000.00),
(5, 'Eve', 85000.00);
Query for nth Highest Salary
SET @n := 2; -- Set the value of n dynamically
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT @n - 1, 1;
Explanation
1. @n Variable:
o Dynamically sets the rank nnn for the desired salary.
2. DISTINCT salary:
o Ensures unique salaries are considered in case of duplicates.
3. ORDER BY salary DESC:
o Orders salaries in descending order, ranking the highest salary first.
4. LIMIT @n - 1, 1:
o Skips the top n−1n-1n−1 salaries and retrieves the next one.
Alternative Query Using Window Functions (MySQL 8.0+)
If the database supports window functions, you can use the DENSE_RANK() function:
SET @n := 2; -- Set the value of n dynamically
WITH ranked_salaries AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank
FROM
employees
SELECT salary
FROM ranked_salaries
WHERE rank = @n;
Explanation (Window Functions)
1. DENSE_RANK():
o Assigns a unique rank to each salary in descending order. Duplicate salaries
get the same rank.
2. WITH ranked_salaries:
o Creates a temporary table with salaries and their respective ranks.
3. WHERE rank = @n:
o Filters the result to return only the nth rank.
Sample Output
For n=2n = 2n=2:
Salary
75000.00
Key Notes
• Use the DISTINCT keyword to handle duplicate salaries for the LIMIT method.
• Use DENSE_RANK() if you want to consider duplicate salaries as a single rank.
6. Explain how indexing works in SQL and how to decide which columns should be indexed
for optimal performance.
How Indexing Works in SQL
An index is a database structure that improves the speed of data retrieval operations on a
table. It works like an optimized lookup table for the database, allowing it to quickly locate
rows without scanning the entire table.
• Structure: Most indexes are implemented as balanced tree structures (e.g., B-trees)
or hash tables. These structures allow efficient searching, insertion, and deletion
operations.
• Function: When a query is executed, the database engine checks if an index is
available for the columns involved in the query’s filters or joins. If so, the engine
uses the index to locate the rows, reducing the need for a full table scan.
Types of Indexes
1. Primary Index:
o Automatically created for the primary key column.
o Ensures unique values and quick lookups for primary key operations.
2. Unique Index:
o Ensures that all values in the indexed column are unique.
3. Clustered Index:
o Reorders the physical storage of table data to match the index order.
o A table can have only one clustered index.
4. Non-clustered Index:
o Creates a separate structure to store the index and points to the table rows.
o A table can have multiple non-clustered indexes.
5. Composite Index:
o Indexes multiple columns together.
6. Full-Text Index:
o Optimized for searching text data, such as finding words or phrases in large
text fields.
Benefits of Indexing
• Faster Query Execution: Speeds up SELECT, JOIN, and WHERE clause operations.
• Reduced I/O Operations: Fewer rows are read from the disk.
• Sorted Data Retrieval: Helps with ORDER BY and GROUP BY clauses.
Drawbacks of Indexing
• Slower Write Operations: INSERT, UPDATE, and DELETE operations become slower
because the index must also be updated.
• Storage Overhead: Indexes consume additional disk space.
• Maintenance Overhead: Indexes need to be maintained, especially in tables with
frequent data modifications.
How to Decide Which Columns to Index
1. Frequently Queried Columns:
o Index columns that appear frequently in WHERE, JOIN, ON, ORDER BY, or
GROUP BY clauses.
2. Primary Keys and Unique Constraints:
o Always index primary key columns as they uniquely identify rows.
3. Foreign Keys:
o Index foreign key columns to improve JOIN performance.
4. High-Selectivity Columns:
o Choose columns with a wide range of unique values (e.g., a user_id column)
because indexes work best with high selectivity.
5. Composite Indexes:
o Use composite indexes when multiple columns are often queried together.
For example, for queries like:
SELECT * FROM sales WHERE year = 2023 AND region = 'North';
A composite index on (year, region) will perform better than individual indexes.
6. Avoid Low-Selectivity Columns:
o Avoid indexing columns with few distinct values (e.g., gender or status with
values like 'Active' or 'Inactive').
7. Read-Heavy Tables:
o Index columns in tables where SELECT operations are more frequent than
INSERT/UPDATE/DELETE.
Examples
Scenario 1: Searching by email in a user table
CREATE INDEX idx_email ON users(email);
• Improves performance for queries like:
SELECT * FROM users WHERE email = '[email protected]';
Scenario 2: Composite index for a sales table
CREATE INDEX idx_year_region ON sales(year, region);
• Optimizes queries with:
SELECT * FROM sales WHERE year = 2023 AND region = 'North';
Scenario 3: Indexing a foreign key
CREATE INDEX idx_customer_id ON orders(customer_id);
• Speeds up JOINs like:
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id;
Monitoring and Tuning
1. EXPLAIN Plan:
o Use EXPLAIN to analyze how the database executes a query and whether it
uses an index.
2. Query Performance Metrics:
o Monitor slow queries and identify columns for potential indexing.
3. Index Maintenance:
o Periodically rebuild or reorganize indexes to ensure they remain efficient.
Summary
• Use indexes on frequently queried, high-selectivity columns.
• Avoid excessive indexing on write-heavy tables.
• Analyze query patterns and use tools like EXPLAIN to make data-driven decisions
about indexing.
7. Describe the differences between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN and
when to use each one in a complex query.
Differences Between LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
In SQL, JOIN operations combine rows from two or more tables based on a related column.
The differences among LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN lie in how
unmatched rows are handled.
1. LEFT JOIN
• Definition: Returns all rows from the left table and the matched rows from the right
table. If no match is found, the result contains NULL for columns from the right
table.
• Use Case: Use when you want all records from the left table regardless of whether
there is a match in the right table.
Syntax
SELECT columns
FROM table1
LEFT JOIN table2
ON table1.common_column = table2.common_column;
Example
• Tables:
o Customers:
CustomerID Name
1 Alice
2 Bob
3 Charlie
o Orders:
OrderID CustomerID
101 1
102 2
• Query:
SELECT c.Name, o.OrderID
FROM Customers c
LEFT JOIN Orders o
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Name OrderID
Bob 102
Charlie NULL
2. RIGHT JOIN
• Definition: Returns all rows from the right table and the matched rows from the left
table. If no match is found, the result contains NULL for columns from the left table.
• Use Case: Use when you want all records from the right table regardless of whether
there is a match in the left table.
Syntax
SELECT columns
FROM table1
RIGHT JOIN table2
ON table1.common_column = table2.common_column;
Example
• Query:
SELECT c.Name, o.OrderID
FROM Customers c
RIGHT JOIN Orders o
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Bob 102
3. FULL OUTER JOIN
• Definition: Combines the results of LEFT JOIN and RIGHT JOIN. Returns all rows
from both tables, with NULL in columns where no match exists.
• Use Case: Use when you want to include all records from both tables, showing
unmatched rows with NULL values.
Syntax
SELECT columns
FROM table1
FULL OUTER JOIN table2
ON table1.common_column = table2.common_column;
Example
• Query:
SELECT c.Name, o.OrderID
FROM Customers c
FULL OUTER JOIN Orders o
ON c.CustomerID = o.CustomerID;
• Result:
Name OrderID
Alice 101
Bob 102
Charlie NULL
When to Use Each Join in Complex Queries
1. LEFT JOIN:
o When the left table contains a primary set of data and you want to include all
rows, even if they have no matching data in the right table.
o Example: Listing all customers, including those who haven't made any
orders.
2. RIGHT JOIN:
o When the right table contains a primary set of data and you want to include
all rows, even if they have no matching data in the left table.
o Example: Listing all orders, including those made by unregistered
customers.
3. FULL OUTER JOIN:
o When both tables are equally important, and you want to analyze all data
points, even unmatched rows.
o Example: Creating a comprehensive report that includes all customers and
all orders, showing unmatched customers or orders.
Key Differences in a Nutshell
Feature LEFT JOIN RIGHT JOIN FULL OUTER JOIN
Rows from Left Table Always Included Only if Matched Always Included
Rows from Right
Only if Matched Always Included Always Included
Table
NULL in Right NULL in Left NULL in Both
Unmatched Rows
Columns Columns Columns
Visual Representation
If A represents rows from the left table and B represents rows from the right table:
• LEFT JOIN: A∪(A∩B)A \cup (A \cap B)A∪(A∩B)
• RIGHT JOIN: B∪(A∩B)B \cup (A \cap B)B∪(A∩B)
• FULL OUTER JOIN: A∪BA \cup BA∪B
Performance Tips
• Use LEFT JOIN or RIGHT JOIN instead of FULL OUTER JOIN if you only need one
side's unmatched rows, as it reduces computation.
• Always use indexes on the columns used in the ON clause to improve performance
in joins.
8. What is the difference between HAVING and WHERE clauses in SQL, and when would
you use each?
Difference Between HAVING and WHERE Clauses in SQL
1. WHERE Clause
• Purpose: Filters rows before any aggregation.
• Scope: Applied before any GROUP BY operation.
• Use Case: Used to filter rows based on conditions applied to individual columns.
Syntax:
SELECT column1, column2, ...
FROM table_name
WHERE condition;
Example:
SELECT customer_name, total_orders
FROM orders
WHERE total_orders > 50;
• Explanation: Filters out rows before aggregation (i.e., filters orders with total_orders
> 50).
2. HAVING Clause
• Purpose: Filters the aggregated results (after applying GROUP BY).
• Scope: Applied after the GROUP BY operation.
• Use Case: Used to filter aggregated data based on conditions applied to aggregate
functions like SUM, AVG, COUNT, etc.
Syntax:
SELECT column1, aggregate_function(column2)
FROM table_name
GROUP BY column1
HAVING condition;
Example:
SELECT category, COUNT(order_id) AS total_orders
FROM orders
GROUP BY category
HAVING total_orders > 10;
• Explanation: Filters the aggregated results (only categories with total_orders > 10
are included).
Key Differences
Feature WHERE Clause HAVING Clause
Purpose Filters rows before aggregation Filters aggregated results
Scope Applied to the table rows individually Applied to the grouped results
Usage Used to filter individual rows Used to filter aggregated results
Can apply conditions to non- Can apply conditions to aggregated
Conditions
aggregated columns (before grouping) columns (after grouping)
Example WHERE total_orders > 50 HAVING COUNT(order_id) > 10
When to Use Each
1. Use WHERE when:
o You need to filter rows based on conditions before performing any
aggregation.
o Example: Filtering customer records where the order count is more than 50.
2. Use HAVING when:
o You need to filter the results of an aggregation.
o Example: Counting orders by category and filtering categories with more than
10 orders.
Practical Scenario
-- Example using both WHERE and HAVING
SELECT category, COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= '2024-01-01' -- Filtering based on date before aggregation
GROUP BY category
HAVING total_orders > 10; -- Filtering aggregated results
This will show categories with more than 10 orders placed after January 1, 2024.