Thanks to visit codestin.com
Credit goes to datadriven.io

The SQL Round

The SQL round shows up in roughly nine of every ten data engineering loops, and on most of those it shows up twice (phone screen and onsite). It's the gating round; if you don't pass it, the loop ends. Worked solutions, the patterns interviewers score on, and the failure modes that lose the round. Part of the our data engineer interview prep hub.

What the SQL round actually tests

The patterns are consistent across companies. The depth and the level of follow-up isn't. Frequencies are approximate, weighted by what shows up in verified interview reports.

PatternRoughly how oftenWhere it lands hardest
INNER and LEFT JOINHighestEvery loop, all levels
GROUP BY with HAVINGHighestEvery loop, all levels
Window functions (PARTITION BY)HighL4 and up, FAANG
Date trunc and date mathHighEvery loop
ROW_NUMBER deduplicationHighEvery loop
NULLIF and COALESCE patternsCommonEvery loop
RANK and DENSE_RANKCommonL4 and up
Self join on inequalityCommonL4 and up, FAANG
Gap-and-island problemsCommon at seniorFAANG, modeling-heavy roles
Rolling and moving averagesCommonAnalytics-heavy roles
Pivot with conditional aggregationCommonEvery loop
Recursive CTEOccasionalSenior, modeling-heavy roles
EXISTS vs INOccasionalOptimization rounds
EXPLAIN plan readingSenior signalOptimization rounds at L5 and up

Five Worked Solutions From Real Loops

Each problem appeared in at least three reported interview loops in the last 18 months. Every solution shows the query, the gotchas the interviewer is testing for, and the follow-up question they will ask next.

Pattern: Window Dedup

Keep the most recent row per user

A user_events table has duplicate rows because of an at-least-once pipeline. Return one row per user_id, keeping the row with the latest event_ts. Ties broken by event_id descending.
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_timestamp DESC, event_id DESC
    ) AS rn
  FROM event_data
)
SELECT user_id, event_timestamp, event_id, payload
FROM ranked
WHERE rn = 1;
Pattern: Gap and Island

Find users active for 3+ consecutive days

The trick: subtract a per user ROW_NUMBER from the date column. Consecutive dates produce the same difference, so they collapse into one group. The interviewer wants you to show the math out loud with a 5-row example.
WITH daily AS (
  SELECT DISTINCT user_id, substr(event_timestamp, 1, 10) AS event_date
  FROM event_data
),
grouped AS (
  SELECT
    user_id,
    event_date,
    julianday(event_date) -
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date)
      AS streak_key
  FROM daily
)
SELECT user_id, MIN(event_date) AS streak_start, COUNT(*) AS streak_days
FROM grouped
GROUP BY user_id, streak_key
HAVING COUNT(*) >= 3;
Pattern: Rolling Window

7-day rolling revenue with explicit partial window handling

Most candidates write the window then stop. The follow-up is always: what does the first 6 rows show, and why is that wrong for a dashboard? Mention partial windows before being asked.
WITH daily_revenue_summary AS (
  SELECT
    substr(transaction_date, 1, 10) AS order_date,
    SUM(total_amount) AS daily_revenue
  FROM transactions
  GROUP BY substr(transaction_date, 1, 10)
)
SELECT
  order_date,
  daily_revenue,
  CASE
    WHEN ROW_NUMBER() OVER (ORDER BY order_date) >= 7
    THEN AVG(daily_revenue) OVER (
      ORDER BY order_date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    )
  END AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;
Pattern: Self Join

Pairs of employees who share a manager

The inequality e1.id < e2.id is the only thing the interviewer cares about. It eliminates self-pairs and reverse duplicates in one condition. Candidates who use !=, then dedupe in a CTE, lose points for inefficiency.
SELECT
  e1.employee_id AS employee_a,
  e2.employee_id AS employee_b,
  e1.manager_id
FROM employees e1
JOIN employees e2
  ON e1.manager_id = e2.manager_id
  AND e1.employee_id < e2.employee_id;
Pattern: Recursive CTE

All reports under a manager, including indirect

The interviewer is checking three things: that you write the base case correctly, that you JOIN the recursive step on the previous iteration not the source table, and that you have a depth guard against cycles in the org chart.
WITH RECURSIVE org AS (
  SELECT employee_id, manager_id, emp_name, 0 AS depth
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.employee_id, e.manager_id, e.emp_name, o.depth + 1
  FROM employees e
  JOIN org o ON e.manager_id = o.employee_id
  WHERE o.depth < 20
)
SELECT employee_id, emp_name, depth FROM org WHERE depth > 0;

What Interviewers Watch For (And Score You On)

  1. 01

    Explicit NULL handling

    Every JOIN, every WHERE, every aggregate. Saying 'COUNT ignores NULL but COUNT(*) does not' before being asked is a strong signal. Most candidates think about NULL only after a wrong answer.
  2. 02

    Grain awareness

    Before the first SELECT, state the grain of your output: 'one row per user per day'. This single sentence tells the interviewer you understand aggregation. Skipping it forces them to ask, which costs points.
  3. 03

    Window vs aggregate choice

    If the result needs a per-row context (running total, rank, lag), the answer is a window function. If the result collapses rows, it is GROUP BY. Mixing the two without a clear reason is the most common L4+ rejection signal.
  4. 04

    ROW_NUMBER vs RANK vs DENSE_RANK

    ROW_NUMBER assigns 1, 2, 3, 4. RANK assigns 1, 1, 3, 4 (skips). DENSE_RANK assigns 1, 1, 2, 3 (no skip). Pick based on whether ties matter for your use case. Saying which you chose and why is required for senior roles.
  5. 05

    Edge cases stated proactively

    Empty table, all NULL inputs, single-row groups, ties at the boundary, partial windows. Saying 'on an empty input this returns an empty set, which I think is correct here' is the experience signal interviewers are looking for.
  6. 06

    Query readability

    CTEs over nested subqueries. Lowercase columns, uppercase keywords. One CTE per logical step with a name that reads like English. Interviewers explicitly score readability for L5+ roles.
  7. 07

    Performance instinct

    If a problem says 'over 5 billion rows', mention partition keys, predicate pushdown, and avoiding cross joins. You do not need to optimize. You need to show that performance is on your radar.

Why GROUP BY beats window functions when both fit

Most candidates rush to window functions because they look senior. They mostly aren't. If the problem reduces to one row per group with no per-row context, GROUP BY is correct, faster to write, and easier to defend.

Use the simplest construct that produces the right answer. Window functions earn their complexity when you need a value from a non-aggregated row alongside an aggregate: last login alongside total logins, current order amount alongside the previous one. The interviewer wants the right tool, not the fanciest. The same judgment shows up in vanilla Python patterns interviewers test, where reaching for pandas on a problem that needs a dict is the equivalent failure mode.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
MicrosoftInterview question
Solve a problem

How the SQL Round Connects to the Rest of the Loop

SQL fluency is the entry condition for the rest of the loop, but the patterns recur in surprising places. Gap-and-island problems reappear inside vanilla Python patterns interviewers test as sessionization questions. ROW_NUMBER deduplication shows up again inside star schema and SCD round prep when you argue for SCD Type 2 over Type 1. EXPLAIN-plan reading and partition pruning are the natural lead-in to system design framework for data engineers, where the same instincts scale up to data warehouse design.

Take-home assignments often bundle SQL with a small modeling task, which is why the data engineer take-home prep overlaps here. And if you're targeting a specific company, the live SQL round at Stripe leans hard on financial-precision SQL while the loop at Airbnb tests experimentation-platform SQL. Read the company-specific guides after this one.

Tools and Technologies Tested in the SQL Round

The SQL dialect of the round usually mirrors the company stack. Snowflake-heavy companies test Snowflake-specific functions (QUALIFY, FLATTEN). BigQuery-heavy companies test ARRAY_AGG and STRUCT manipulation. Postgres companies stay portable. The underlying patterns transfer, but the syntax does not. If you already know the company stack, drill the right dialect: see BigQuery internals interview prep, Redshift internals interview prep, PostgreSQL internals interview prep, or our SQL question reference for Data Engineer interviews reference for the full set.

How to Prepare in Four Weeks

  1. 01

    Week 1: Joins and aggregation grain

    30 problems on INNER, LEFT, FULL, and self join. Every problem requires you to state the output grain before writing. Use the joins-practice and group-by-practice problem sets on DataDriven. Goal: medium problems in under 12 minutes.
  2. 02

    Week 2: Window functions end to end

    Every window function: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTILE, SUM OVER, AVG OVER. 25 problems. Practice ROWS vs RANGE frame clauses. Goal: explain partial-window behavior before the interviewer asks.
  3. 03

    Week 3: Patterns and edge cases

    Gap and island, deduplication, sessionization, top-N per group, percentile calculations, pivot with conditional aggregation. 20 problems. Time yourself: medium under 15 minutes, hard under 25.
  4. 04

    Week 4: Live mock interviews

    10 mock interview rounds with the SQL mock interview tool on DataDriven. Speak out loud the entire time, including state-the-grain, state-edge-cases, and state-the-trade-off. Record yourself and replay. The gap between silent thinking and verbal thinking is what gets candidates rejected.

SQL Round FAQ

What SQL dialect is used in data engineering interviews?+
PostgreSQL syntax is the most common default in 2026, with about 70% of loops accepting it. Some companies use BigQuery (Google), Redshift (Amazon), Snowflake (data-warehouse-heavy roles), or Spark SQL. Write portable ANSI SQL when in doubt and ask the interviewer if a specific dialect is required.
How long should I take to write a medium SQL problem?+
12 to 15 minutes for a medium, 20 to 25 minutes for a hard. Faster than this is fine if your query is correct. Slower than this and the interviewer will start helping, which lowers your score even if you finish.
Can I use ChatGPT to practice SQL interview questions?+
Generating questions with an LLM is fine. Generating answers is counterproductive because you skip the muscle memory of writing the query. Practice on a tool that actually executes your SQL against real schemas. Reading correct queries does not build the speed you need to write them under interview pressure.
Do I need to memorize SQL window functions?+
Yes. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, AVG OVER, FIRST_VALUE, LAST_VALUE, and the ROWS BETWEEN frame clause should be reflexive. If you have to look up syntax during the interview, the interviewer will note it and lower your fluency score.
What if I do not finish the SQL problem in time?+
Get the structure right, even if it does not run. A correctly structured query with one syntax bug shows 95% of what the interviewer is evaluating. A complete but incorrect approach shows nothing. Always state your approach in 30 seconds before writing, so the interviewer can give partial credit if you run out of time.
Are SQL questions different at FAANG vs startups?+
FAANG SQL rounds are more rigorous on edge cases (NULL behavior, ties, empty inputs) and on optimization (EXPLAIN plans, predicate pushdown). Startups care more about practical correctness on messy real-world data. The query patterns are identical. The depth of follow-up is different.
Should I use CTEs or subqueries?+
CTEs, almost always. They are easier to read, easier to debug, and easier to extend. Some legacy MySQL versions do not support them, so know the subquery form too. In any modern interview, CTE is the default.
How important is performance in a SQL interview?+
For L3 to L4 roles, mentioning performance once is enough. For L5+ roles, you should be able to read an EXPLAIN plan, identify a full table scan, and propose a fix. For L6+ data infrastructure roles, expect a dedicated optimization round.
02 / Why practice

Drill the round against a real grader

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

More data engineer interview prep reading

More data engineer interview prep guides