Resources

The short list of what's on the site that's worth your time, organized by the decision you're trying to make.

This is the short list of what's on the site, organized by the decision you're trying to make. Not everything is here on purpose. The deeper material lives on its own page; the cheat-sheets and one-off references are linked from the relevant guide.

If you're new and want one place to start, the interview prep pillar is the right page. If you're two weeks out and want one thing to practice, the SQL problem bank is the highest-yield use of an hour.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

The five round walkthroughs

Every data engineering loop tests the same five rounds. Each guide walks the round end to end with the patterns interviewers score on and the failure modes that lose the round: SQL, Python, data modeling, system design, and behavioral. If you read one round guide and find it useful, the others are written in the same voice.

Practice catalogs

Live problem banks with real graders. Free, no login required to start.

SQL practice problems→

Graded against a real Postgres instance with randomized fixtures.

Pipeline architecture canvas→

Draw the pipeline, pick the tools, get scored against the SLA.

Data modeling practice→

Schema design problems with grain-first rubrics.

Python practice problems→

Data-engineering-shaped Python, not LeetCode algorithms.

Pillar guides for the four domains

Long reads on each domain, written as the version you'd hand to a friend before their loop: SQL interview questions with fifteen worked solutions, Python interview questions, data modeling, and pipeline architecture. If you want the hundred questions weighted by interview frequency, that's the top 100.

Company interview guides

The loops that bend the standard template enough to be worth knowing in advance. Round structure, what's actually asked, where the loop deviates.

Meta→

SQL- and modeling-heavy. E3 to E7 leveling.

Stripe→

System design depth. Second design round at senior.

Databricks→

Delta Lake and PySpark grade strictly here.

Netflix→

Streaming and large-scale event processing.

Airbnb→

Metrics and experimentation rounds.

Uber→

Heavy data modeling. Live schema critique.

Google→

Big tech standard with BigQuery dialect bias.

Amazon→

Leadership Principles plus L4-L7 technical bar.

Snowflake→

SQL optimization and Snowflake-specific syntax.

Decision guides

High-intent comparisons for the role-and-tool decisions that change what you should study. Data engineer vs data analyst for the role question. ETL vs ELT, batch vs streaming, dbt vs Airflow, Snowflake vs Databricks, Kafka vs Kinesis for the tool questions interviewers actually pose. If you're switching in from another role, the analyst-to-engineer transition guide is the one to read first.

Live Viewers, Live Billing

> We run a live video platform where creators broadcast to thousands of viewers at once. The product team wants real-time viewer counts and chat activity for creators, and the ads team needs accurate impression data for billing. Design a data pipeline for our livestream events.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Career artifacts

The boring but high-leverage stuff: the resume guide with examples by level, the roadmap for what to learn and in what order, salary by level, and portfolio projects that actually move the needle in a recruiter screen. The recruiter screen is the easiest part of the loop to win and the easiest to underestimate.

Tool-specific references

Open these only when you know the company's stack and you're targeting that specific dialect: PySpark, Kafka, Airflow, dbt, Snowflake, Databricks. The full hub is tools. Most candidates over-study these; the loop tests them less often than the JDs suggest.

02 / Why practice

Stop reading. Solve one problem.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open a random problem