0% found this document useful (0 votes)

18 views21 pages

Data Engineering Interview Question and Ans Chatgpt

Uploaded by

bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views21 pages

Data Engineering Interview Question and Ans Chatgpt

Uploaded by

bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Perfect 👍 Let’s start from the first question in the list:

Q1. What is the difference between OLTP and OLAP systems?

👉 Answer (Short & Crisp for Interview):
 OLTP (Online Transaction Processing):
o Handles day-to-day transactions.

o Focus: Insert, Update, Delete operations.

o Data is highly normalized.

o Example: Banking system, flight booking.

 OLAP (Online Analytical Processing):

o Used for analysis and reporting.

o Focus: Read-heavy, Aggregation, Complex Queries.

o Data is denormalized for faster queries.

o Example: Business Intelligence dashboards, sales analysis.

👉 Analogy:
 OLTP = Cashier in a store (quick transactions).
 OLAP = Store manager analyzing monthly sales trends.
👉 Scenario-based Follow-up Questions:
1. If you have to design a data warehouse, which system will you use?
(OLAP)
2. Why should we not run complex analytical queries on OLTP? (It slows
down transactions.)
3. How do OLTP and OLAP work together in a real company? (ETL pipelines
extract data from OLTP → load into OLAP for reporting.)

⚡ Would you like me to continue in this step-by-step format (Q → Answer →

Analogy → Follow-ups) for each question?

Great 👍 Let’s move to the second question.

Q2. Explain primary key vs unique key vs foreign key.

👉 Answer (Interview-ready):
1. Primary Key
o Uniquely identifies each row in a table.

o Cannot have NULL values.

o Only one primary key allowed per table.

o Example: EmployeeID in Employees table.

2. Unique Key
o Ensures all values in a column are unique.

o Can contain one NULL value.

o A table can have multiple unique keys.

o Example: Email in Users table.

3. Foreign Key
o Establishes relationship between two tables.

o Points to primary key in another table.

o Enforces referential integrity.

o Example: Order.CustomerID references Customer.CustomerID.

👉 Analogy:
 Primary key = Your Aadhar number (unique, mandatory).
 Unique key = Your mobile number (unique, but you may not have one
= NULL allowed).
 Foreign key = A link between you and your father’s ID (relation between
tables).

👉 Follow-up Questions Interviewer May Ask:

1. Can a table have multiple primary keys? (No, but it can have a
composite primary key.)
2. What happens if you try to insert duplicate values in a unique key
column? (Error.)
3. If a record is deleted from the parent table, what happens to foreign key
references? (Depends: CASCADE, SET NULL, RESTRICT.)

⚡Would you like me to go deeper into SQL query examples for these keys, or
continue with the next interview question?
Awesome 🚀 Let’s continue.

Q3. What is indexing? How does it improve performance, and what are
its downsides?
👉 Answer (Interview-ready):
 Indexing is a technique used in databases to speed up data retrieval.
 An index is like a book’s table of contents – instead of scanning all
pages, you directly jump to the section.
✅ How it improves performance:
 Reduces I/O operations by avoiding full table scans.
 Improves performance of SELECT queries, WHERE, JOIN, ORDER BY.
❌ Downsides of Indexing:
1. Slower INSERT/UPDATE/DELETE (because index also needs to be
updated).
2. Takes extra storage space.
3. Too many indexes can confuse the optimizer → performance
degradation.

👉 Analogy:
 Imagine a dictionary – without index, you read every word to find “Data.”
 With index, you jump straight to “D” section. Fast lookup, but creating that
index takes effort.

👉 Follow-up Questions Interviewer May Ask:

1. What is the difference between Clustered vs Non-clustered index?
o Clustered Index: Rearranges the actual table data in order.
(Only one per table.)
o Non-clustered Index: Creates a separate structure with
pointers. (Multiple allowed.)
2. What queries benefit from indexing?
o Queries with WHERE conditions, joins, sorting, grouping.

3. What happens if you create an index on a column with low cardinality

(like gender: M/F)?
o Index won’t help much – better for high cardinality columns (like
user IDs, emails).

⚡ Do you want me to also show you a SQL example of creating clustered

and non-clustered indexes, or should I directly move to the next question?
Perfect — got your preference. From now on I’ll explain each answer in detail,
include the internal reasons (what the DB engine is actually doing), and keep
the language simple.
I’ll continue with the next question now.
Q4 — How do you handle slow queries in SQL?
(Full explanation with internals, reasons, and practical fixes)

1) Quick summary (one-liner)

To fix a slow query: measure what the database is doing, find the root cause
(bad plan, missing index, stale stats, big IO, bad join), then apply the appropriate
fix (index, rewrite, update stats, partition, tune memory or schema). Repeat until
the plan is fast.

2) First principle — how the DB decides what to do

The query optimizer builds a query plan using:
 Statistics (how many rows, distinct values, histograms),
 Costs (estimated I/O, CPU, memory),
 Available indexes and constraints.
If statistics are wrong or an index is missing, the optimizer picks a bad plan
(e.g., full table scan, nested-loop join when hash join would be better), and the
query becomes slow.

3) How to measure — reproduce & inspect

Always start by running the DB’s plan tool:
 Postgres: EXPLAIN / EXPLAIN ANALYZE <query> (shows estimated vs
actual).
 MySQL: EXPLAIN SELECT ... and EXPLAIN ANALYZE (newer versions).
 SQL Server: Execution Plan, SET STATISTICS IO ON, SET STATISTICS TIME
ON.
What to look for in the plan:
 Seq Scan / Table Scan — table fully read (bad when table is large).
 Index Scan — index used (good for selective filters).
 Join type — nested loop / hash / merge (some are expensive for large
inputs).
 Row estimates vs actual rows — big mismatch → bad statistics →
wrong plan.
 Spills to disk (sorts or hashes using temp files) → memory insufficient.

4) Common root causes, internals, and fixes

A. Missing or wrong indexes
Internals: Without an index, the engine must read every row (full table scan).
Fixes:
 Create index on columns used in WHERE, JOIN ON, ORDER BY (when
selective).
Example: CREATE INDEX idx_orders_customerid ON orders(customer_id);
 Use covering index when query can be answered from the index alone
(avoid fetching table rows).
Why it helps: Index narrows I/O to only relevant pages; less disk read,
less CPU.

B. Stale/incorrect statistics
Internals: Optimizer uses statistics (row counts, distinct values) to choose plan.
If stats are old, optimizer mis-estimates cost.
Fixes:
 Refresh stats: ANALYZE table; (Postgres), UPDATE STATISTICS (SQL Server),
ANALYZE TABLE (MySQL).
Why it helps: Correct stats → better plan selection (e.g., pick hash join
instead of nested loop).

C. Non-sargable queries (functions on columns)

Internals: Using functions on a column (e.g., WHERE lower(email) = 'x')
prevents use of normal B-tree index.
Fixes:
 Avoid function on column; transform constant instead. Or create a
functional index if DB supports it:
CREATE INDEX idx_lower_email ON users (lower(email));
 Or store normalized value in a separate column (email_norm) and index
that.
Why it helps: Allows index seek instead of scanning all rows.
D. Bad JOIN strategies or join order
Internals: Joins can use nested-loop (fast for small outer), hash (build hash table
on smaller side), or merge (both sorted). If optimizer chooses the wrong one or
chooses a large table as inner, cost explodes.
Fixes:
 Ensure join keys are indexed on the correct side.
 Rewrite query (push predicates earlier), or use hints (sparingly) to guide
join type.
Why it helps: Proper join strategy reduces data shuffling and CPU.

E. Large sorts / GROUP BY spilling to disk

Internals: Sorting or hashing large datasets may exceed memory (work_mem /
sort_buffer), causing disk spills (very slow).
Fixes:
 Increase memory for sort (work_mem in Postgres) for that session.
 Create index matching the ORDER BY to avoid sorting.
 Use partial aggregation or materialized views.
Why it helps: Keeps work in RAM and avoids slow disk I/O.

F. Scanning too many rows (low selectivity)

Internals: Index helps only when filter is selective (returns small fraction). For
low-selectivity columns (e.g., boolean), index may not help.
Fixes:
 Reconsider indexing strategy: compound indexes (multi-column) for
combined filters work better.
 Use partitioning to prune data (e.g., by date).
Why it helps: Narrow scanning to relevant partitions or combinations.

G. Partitioning / pruning not used

Internals: With partitioned tables, optimizer can skip irrelevant partitions
(partition pruning). If query doesn’t filter on partition key, pruning fails.
Fixes:
 Partition large tables by date or another logical key.
 Ensure queries include the partition key (or use constraint exclusion).
Why it helps: Reduces scanned rows from billions to a small subset.

H. Correlated subqueries vs joins

Internals: Correlated subqueries execute per-row (expensive).
Fixes:
 Convert correlated subqueries to joins or use LATERAL/WITH where
appropriate.
Why it helps: Single pass operations with joins are usually faster.

I. Data skew in distributed systems (Spark, distributed DBs)

Internals: If one key has huge volume, one worker gets overloaded (hot
partition).
Fixes:
 Use salting, custom partitioning, broadcast join for small side, or pre-
aggregate heavy keys.
Why it helps: Redistributes work evenly across nodes.

J. Locking / long transactions

Internals: Long-running transactions can hold locks, causing other queries to
wait.
Fixes:
 Shorten transactions, use proper isolation level (read-committed), or use
snapshot reads if DB supports it.
Why it helps: Reduces blocking and waiting.

K. Resource/configuration & hardware limits

Internals: Poor memory, slow disks, small buffer pool lead to excessive I/O.
Fixes:
 Tune DB config (buffer pool, work memory, parallel workers).
 Add more RAM, faster disks (SSDs), or scale horizontally.
Why it helps: Decreases physical I/O and can speed many queries.

5) Practical diagnosis checklist (step-by-step)

1. Reproduce slowness with a test query.
2. Run EXPLAIN / EXPLAIN ANALYZE and inspect plan.
3. Check estimated vs actual rows — big mismatch → stale stats.
4. See if plan uses an index or does Seq Scan.
5. Check I/O vs CPU (is it disk-bound?).
6. Check for locks / long transactions.
7. Try rewriting the query (remove SELECT *, remove functions on columns).
8. Add or adjust indexes, update statistics.
9. Consider partitioning or materialized views for repeated heavy
queries.
10.Tune DB memory for sorts/hashes or change join strategy.

6) Simple example (real-life style)

Slow query:
SELECT *
FROM orders
WHERE LOWER(customer_email) = '[email protected]';
Problem: LOWER(customer_email) prevents index use → full table scan.
Fix options:
1. Store normalized email in a separate column and index it:
ALTER TABLE orders ADD COLUMN email_norm text;
UPDATE orders SET email_norm = lower(customer_email);
CREATE INDEX idx_orders_email_norm ON orders(email_norm);
-- query uses:
SELECT * FROM orders WHERE email_norm = '[email protected]';
2. Or create a functional index (if DB supports):
CREATE INDEX idx_orders_lower_email ON orders (lower(customer_email));
Why this helps: now DB can perform an index seek — it reads far fewer pages
and returns results much faster.

7) When to use materialized views or caching

If the query is an expensive aggregation run frequently and data freshness
allows: create a materialized view or pre-computed table and refresh
periodically. That trades off storage and update complexity for very fast reads.

8) When you can’t fix the query — scale

If the query is inherently heavy (big analytics over petabytes), consider:
 Pre-aggregate data,
 Use data warehouse optimized for analytic workloads (columnar storage,
vectorized execution),
 Use distributed compute or OLAP engines.
9) Wrap-up (practical mindset)
 Measure first (EXPLAIN).
 Smallest change that fixes the cause — often index or stats.
 Always understand why a query plan was chosen (stats/cardinality) — fix
the root not the symptom.

Want me to:
 Re-answer the previous question (indexing) in this same in-depth style, or
 Move to the next interview question and continue in this detailed +
simple format?

Perfect 👍 Let’s move to the next question with the same style (detailed, internal
reasoning, simple words).

Q5 — What is the difference between partitioning and bucketing in

Hive?

1) Quick summary (one-liner)

 Partitioning = divide big table into sub-directories based on column
values → reduces scan.
 Bucketing = divide data into fixed number of files using hashing →
balances data distribution and enables efficient joins.
Both aim to optimize query performance, but they solve different problems.

2) First principle — how Hive stores data

Hive tables are stored as files in HDFS.
When you query, Hive reads these files (MapReduce/Spark job under the hood).
 If the table is huge (billions of rows) and you don’t organize data → Hive
may scan every file = slow.
 Partitioning and bucketing are strategies to organize files, so Hive
reads less data or distributes work evenly.

3) Partitioning — explained
What it is
 Partitioning means splitting a table into separate directories based on
values of a partition column.
 Example: Partition by year and month → Hive stores files like:
 /sales/year=2023/month=01/part-0001
 /sales/year=2023/month=02/part-0002
 /sales/year=2024/month=01/part-0003
 When you query with filter:
 SELECT * FROM sales WHERE year = 2023 AND month = 02;
Hive only scans that directory instead of all files.
Internals (why it helps)
 Hive uses partition pruning: only directories matching the filter are read.
 Reduces I/O massively when partition filter is present.
Downsides
 Too many small partitions (high cardinality columns like user_id) → too
many small files/directories → metadata overhead in Hive Metastore.
 If query does not filter by partition column, Hive scans all partitions
(no gain).

4) Bucketing — explained
What it is
 Bucketing divides data in each partition/table into a fixed number of files
(buckets) based on a hash function on a column.
 Example:
 CLUSTERED BY (customer_id) INTO 8 BUCKETS;
Hive computes hash(customer_id) % 8 → decides which bucket file to place the
row in.
Internals (why it helps)
 Ensures even distribution of rows across files → avoids data skew.
 Joins on bucketed columns are faster:
o If two tables are bucketed by the same column and same number of
buckets, Hive can do a map-side join (no shuffle needed).
 Sampling is efficient:
 SELECT * FROM customers TABLESAMPLE(BUCKET 1 OUT OF 8);
Reads only 1/8th of the data directly.
Downsides
 Number of buckets is fixed at table creation (can’t change easily).
 Data loading requires proper hashing → not as flexible as partitioning.
 If bucket count is too high/low → performance issues (imbalanced files).

5) Partitioning vs Bucketing — Side by Side

Feature Partitioning Bucketing

How data is By column values → By hash function → fixed

split directories files

Optimizing joins, sampling,

Best for Reducing scan (filter queries)
balance

Flexibility Easy to add/drop partitions Fixed number of buckets

Storage Each bucket = file inside

Each partition = folder in HDFS
layout folder

Too many small partitions =

Downside Hard to change #buckets later
overhead

6) Simple analogy
 Partitioning = Library shelves labeled by year (only go to 2023 shelf if
searching for 2023 books).
 Bucketing = Inside each shelf, books are evenly distributed into 8 boxes
using author’s name hash (so books are balanced for faster lookup).

7) Follow-up questions interviewer may ask

1. Can we use partitioning and bucketing together?
o Yes. Example: Partition by year and bucket by customer_id.

o Helps when you want both partition pruning + balanced

distribution inside each partition.
2. How to choose between them?
o Use partitioning when queries often filter by that column (date,
region).
o Use bucketing when you need balanced joins/sampling
(customer_id, product_id).
3. What happens if queries don’t filter by partition column?
o Hive scans all partitions → no benefit.

4. How does bucketing help joins?

o If two tables are bucketed on the same column and number of
buckets, Hive can perform bucketed map join (avoids shuffle).

✅ Wrap-up:
 Partitioning = directory-level split by column values (pruning)
 Bucketing = file-level split by hashing (balanced joins)

Would you like me to go deeper into how Hive execution engine

(MapReduce/Tez/Spark) uses partitioning and bucketing internally
during query execution, or move to the next question (Q6: Write a SQL
query to find the second highest salary)?

Perfect 👍 Let’s move to Q6.

Q6 — Write a SQL query to find the second highest salary in an

employee table.

1) Problem Restated
We have a table Employees(emp_id, name, salary) and we need the second
highest salary.
This is a classic SQL interview question because:
 It tests your knowledge of ordering, subqueries, ranking functions.
 Also checks if you understand NULLs, duplicates, performance.

2) Different Approaches
Approach 1 — Using LIMIT / OFFSET (MySQL, Postgres, etc.)
SELECT DISTINCT salary
FROM Employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
👉 Explanation:
 ORDER BY salary DESC → highest first.
 LIMIT 1 OFFSET 1 → skip the first row (highest) and return next one.
 DISTINCT ensures duplicate top salaries don’t cause wrong results.
⚠️Downside: If there are ties (e.g., two employees with the highest salary), this
might still return the correct second salary, but sometimes without DISTINCT
you’d get wrong answers.

Approach 2 — Using Subquery

SELECT MAX(salary) AS second_highest_salary
FROM Employees
WHERE salary < (SELECT MAX(salary) FROM Employees);
👉 Explanation:
 Inner query finds the maximum salary.
 Outer query finds the maximum salary less than that → second
highest.
⚠️Simple, but assumes salaries exist (if only one salary, returns NULL).

Approach 3 — Using Window Functions (ANSI SQL Standard)

SELECT DISTINCT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM Employees
)t
WHERE rnk = 2;
👉 Explanation:
 DENSE_RANK() assigns ranking based on salary.
 Highest salary = rank 1, second highest = rank 2.
 Handles ties properly → if two employees share highest salary, the next
one is rank 2.
✅ Most reliable and recommended in modern SQL (works in Postgres, Oracle,
SQL Server, Hive, Spark SQL, etc.).

Approach 4 — Using TOP (SQL Server)

SELECT TOP 1 salary
FROM (
SELECT DISTINCT TOP 2 salary
FROM Employees
ORDER BY salary DESC
)t
ORDER BY salary ASC;
👉 Explanation:
 Select top 2 salaries, order descending.
 From those, pick the lower one = second highest.

3) Internals — How DB executes

 For Approach 2 (subquery):
1. First pass: compute max salary (SELECT MAX).
2. Second pass: scan table again to find max salary < first max.
 For Approach 3 (window function):
1. DB sorts by salary (ORDER BY).
2. Assigns ranks (DENSE_RANK).
3. Filters rows with rank = 2.
✅ Efficient in analytic databases (columnar, partitioned).

4) Edge Cases Interviewer May Ask

1. What if there’s only one employee?
o Subquery/Window function → returns NULL.

o Must mention handling with COALESCE(..., 'Not Available').

2. What if multiple employees have same second highest salary?

o DENSE_RANK handles it automatically.

3. What if interviewer asks for N-th highest salary?

o Modify rnk = N in window function query.

5) Recommended Answer in Interview

If interviewer allows modern SQL → Window Function (Approach 3) is the
best, most robust, and scalable answer.
If they want old-school SQL (MySQL, basic) → Subquery (Approach 2).
✅ Wrap-up:
 LIMIT/OFFSET: quick, but not portable.
 Subquery: simple, but less flexible.
 Window Function (DENSE_RANK): clean, handles ties, scalable → best.

Would you like me to continue to Q7 (How to handle data skew in joins?)

with the same detailed breakdown, or should I show you some more tricky
variations of “second highest salary” that interviewers often ask?

Great 👍 Let’s move to the next one.

Q6. What is a Data Lake and how is it different from a Data Warehouse?

Answer (Interview-ready, detailed + simple words):

👉 Data Lake:
 A storage repository that can hold raw data in any format (structured,
semi-structured, unstructured).
 Example: Logs, images, JSON, XML, CSV, IoT sensor data.
 Data is stored in its original format until it’s needed.
 Technologies: HDFS, AWS S3, Azure Data Lake, GCP Cloud Storage.
👉 Data Warehouse:
 A centralized system optimized for analytics and reporting.
 Stores structured, cleaned, transformed data.
 Example: Sales dashboards, BI reports, financial analytics.
 Technologies: Snowflake, Redshift, BigQuery, Teradata.

Key Differences (Internals):

1. Schema Approach:
o Data Lake → Schema-on-Read

 Data is dumped first. Schema is applied only when you

query.
 Example: You dump logs in JSON, when querying you define
how JSON should be read.
o Data Warehouse → Schema-on-Write

 Data is cleaned + structured before loading.

 Example: ETL pipeline ensures only proper tables and
columns go into warehouse.
2. Type of Data:
o Data Lake → Can handle all data types.

o Data Warehouse → Only structured/tabular data.

3. Cost & Storage:

o Data Lake → Cheap (stores everything in raw form, often on cloud
object storage).
o Data Warehouse → Expensive (optimized storage, query
performance).
4. Performance:
o Data Lake → Slower queries (raw, unoptimized data).

o Data Warehouse → Fast queries (indexes, pre-computed

aggregations).

Analogy (Simple Words):

 Data Lake = Your entire house store room. You dump everything
(clothes, tools, toys, documents). No structure, but cheap and flexible.
 Data Warehouse = A library. Every book is arranged in sections,
catalogued, indexed. Perfect for fast search, but only books (structured
data), no random stuff like clothes.

Follow-up Questions Interviewer May Ask:

1. Can you use a Data Lake + Data Warehouse together?
o Yes → Modern companies use a Lakehouse Architecture
(Databricks, Snowflake with external tables).
o Store raw data in lake, move transformed data into warehouse.

2. What problems can arise with a Data Lake?

o If governance is missing → it becomes a Data Swamp (nobody
knows what data exists, duplicate, inconsistent formats).
3. Give real-world examples:
o Data Lake: Storing raw customer clicks, social media images.

o Data Warehouse: Generating a sales report of Q1 revenue.

👉 Would you like me to go deeper into Lakehouse architecture (modern

trend) as the next part, or should I continue to Q7 (ETL vs ELT)?

Great 👍 Let’s continue.

Q7. Explain Data Partitioning. Why do we use it in Data Engineering

pipelines?

Answer (Detailed + Simple Explanation):

👉 What is Partitioning?
Partitioning means breaking a large dataset into smaller, manageable
pieces (called partitions) based on some column values (e.g., date, region,
customer ID).
 Each partition is stored separately (different folder, file, or block).
 Instead of scanning the entire dataset, queries read only the relevant
partitions.

Why Do We Use Partitioning?

1. Performance Boost (Less Scanning):
o Without partitioning → query engine scans all data.

o With partitioning → only the relevant partition is scanned.

Example: A table with 1 TB of logs, partitioned by date.
o Query: WHERE date = '2025-08-19' → scans only 1 day’s data, not 1
TB.
2. Parallelism:
o Each partition can be processed in parallel across nodes in
Spark/Hive/BigQuery.
o Improves throughput in distributed systems.

3. Better Data Organization:

o Partitions are stored in a structured directory format.
Example in Hive/Spark:
4. logs/
5. date=2025-08-17/
6. date=2025-08-18/
7. date=2025-08-19/
Easy to manage + optimized queries.

Types of Partitioning (Internals):

1. Range Partitioning:
o Data divided by a range of values.
Example: Sales by year (2019, 2020, 2021).
2. Hash Partitioning:
o Data distributed based on a hash function of a column.
Example: Hash of customer_id decides partition → balances load.
3. List Partitioning:
o Partitions created for specific values.
Example: region = 'APAC', region = 'EU'.
4. Composite Partitioning:
o Combination of two partitioning techniques.
Example: First by region, then by year.

Drawbacks of Partitioning:
 Too many partitions → overhead in metadata management.
 Skewed partitions → one partition is very large, others small (causes
slow queries).
 Requires good choice of partition key – otherwise performance gain is
minimal.

Analogy (Simple):
Think of a library with 1 million books.
 Without partitioning → You search every shelf for "Data Engineering"
books.
 With partitioning → Books are organized by subject (Computer Science,
Math, History). You only check Computer Science → much faster.

Follow-up Questions Interviewer May Ask:

1. How do partitioning and bucketing differ in Hive?
o Partitioning = splits data into directories (prunes irrelevant data).

o Bucketing = splits data inside partitions into fixed-size buckets for

joins.
2. What happens if you choose a bad partition key?
o Example: Partition by gender (M/F) → only 2 partitions → queries still
scan a lot.
3. How does Spark optimize partitioned queries?
o Spark uses partition pruning → only reads partitions needed.

👉 Do you want me to move to the next question (Bucketing in Hive/Spark),

since it’s a natural continuation after partitioning?

Perfect 👍 Let’s continue.

Q8 — What is Bucketing in Hive/Spark and why do we use it?

1) Quick Overview
 Bucketing is a way to divide a table into a fixed number of files
(buckets) based on a hash function on a column.
 Unlike partitioning, bucketing does not create directories, it creates
multiple files inside a table or partition.
 Mainly used to balance data, optimize joins, and enable sampling.

2) How Bucketing Works (Internals)

1. Choose a column to bucket on (e.g., customer_id).
2. Decide number of buckets (e.g., 8 buckets).
3. Hash each row based on the column:
4. bucket_number = hash(customer_id) % 8
5. Row goes into the corresponding bucket file.
Example in Hive:
CREATE TABLE customers_bucketed(
customer_id INT,
name STRING,
email STRING
)
CLUSTERED BY (customer_id) INTO 8 BUCKETS
STORED AS ORC;
 Hive will create 8 files for this table, distributed based on customer_id.

3) Why Bucketing Helps (Internals)

1. Balanced Data Distribution:
o Large tables may have skewed keys → some nodes get more data →
slower processing.
o Bucketing ensures roughly equal rows per file → parallel
processing is balanced.
2. Optimized Joins:
o If two tables are bucketed on the same column and number of
buckets, Hive/Spark can do bucketed map joins.
o No shuffle needed → faster join.

3. Efficient Sampling:
o Query only 1 out of 8 buckets → reduces scan time.

4. SELECT * FROM customers_bucketed TABLESAMPLE(BUCKET 1 OUT OF 8);

4) Difference Between Partitioning and Bucketing

Feature Partitioning Bucketing

How data is By column values → By hash function → files inside

split directories table

Best for Reducing scan (filters) Balanced joins, sampling

Can add/drop partitions

Flexibility Fixed number of buckets
easily

Each partition = metadata Buckets stored in table

Metadata
entry metadata

Performance Good for pruning Good for parallelism & joins

5) Drawbacks / Considerations
 Fixed number of buckets → changing number requires rebuilding table.
 Skewed hash → uneven distribution if hash function is poor.
 Requires careful selection of bucket column (high-cardinality columns
preferred).

6) Analogy (Simple Words)

 Partitioning = splitting books in a library by floor (genre).
 Bucketing = inside each floor, books are split into boxes by author’s
hash → easier to retrieve, evenly distributed.

7) Follow-up Questions Interviewer May Ask

1. Can we use partitioning and bucketing together?
o Yes. Partition by year, bucket by customer_id → combine filtering
+ balanced joins.
2. How does bucketing improve join performance in Spark/Hive?
o Bucketed columns allow map-side joins without shuffling data.

3. What column should you choose for bucketing?

o High-cardinality columns → ensures even distribution.

✅ Wrap-up:
 Partitioning = directory-level split (filtering)
 Bucketing = file-level split (balanced processing & joins)
 Both are used together in large-scale data pipelines for performance.

I can continue with Q9 — Difference between ETL and ELT, which is another
common data engineering question.
Do you want me to go ahead?

SQL Performance Tuning Interview Questions
No ratings yet
SQL Performance Tuning Interview Questions
12 pages
Indexes
No ratings yet
Indexes
4 pages
Beginner Level: DDL DML DCL TCL DQL
No ratings yet
Beginner Level: DDL DML DCL TCL DQL
122 pages
SQL Tuning for Database Professionals
No ratings yet
SQL Tuning for Database Professionals
30 pages
Big Data Engineering Interview Questions
67% (3)
Big Data Engineering Interview Questions
189 pages
Usability Principle
60% (10)
Usability Principle
6 pages
Expert Level SQL
No ratings yet
Expert Level SQL
10 pages
Eco-System For GST and GST Suvidha Providers
76% (17)
Eco-System For GST and GST Suvidha Providers
31 pages
SQL Interview Questions
100% (1)
SQL Interview Questions
14 pages
SQL Interview Prep Guide
100% (1)
SQL Interview Prep Guide
22 pages
KPMG Data Analyst Interview Questions
No ratings yet
KPMG Data Analyst Interview Questions
30 pages
Oracle Cost-Based Optimizer Basics
No ratings yet
Oracle Cost-Based Optimizer Basics
52 pages
Oracle SQL High Performance Tuning: Guy Harrison Director, R&D Melbourne
100% (1)
Oracle SQL High Performance Tuning: Guy Harrison Director, R&D Melbourne
56 pages
Db2 SQL Tuning
No ratings yet
Db2 SQL Tuning
26 pages
SQL Performance Tuning: Ch.V.N.Sanyasi Rao, Tiruveedula Gopi Krishna
No ratings yet
SQL Performance Tuning: Ch.V.N.Sanyasi Rao, Tiruveedula Gopi Krishna
3 pages
Real Data Analyst Interview Questions Answers
No ratings yet
Real Data Analyst Interview Questions Answers
15 pages
12 Query Optimization Tips For Better Performance
No ratings yet
12 Query Optimization Tips For Better Performance
245 pages
Chapter - 8: Let's Make Coding Fun!
No ratings yet
Chapter - 8: Let's Make Coding Fun!
14 pages
!!! Bellingcat's Online Investigation Toolkit
No ratings yet
!!! Bellingcat's Online Investigation Toolkit
24 pages
RDBMS Rapid Fire Sheet Questions
No ratings yet
RDBMS Rapid Fire Sheet Questions
50 pages
Python Rapid Fire
No ratings yet
Python Rapid Fire
8 pages
Parallel Processing in ABAP - Team ABAP
No ratings yet
Parallel Processing in ABAP - Team ABAP
8 pages
As The Oracle Developer-Don't Be A Target For Slow Performance
No ratings yet
As The Oracle Developer-Don't Be A Target For Slow Performance
55 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
Tuning
No ratings yet
Tuning
20 pages
Best Practices of SQL
No ratings yet
Best Practices of SQL
12 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
1,244 pages
SQL Interview Questions and Answers Final PDF
No ratings yet
SQL Interview Questions and Answers Final PDF
28 pages
Network Virtualization For Dummies®, VMware 3rd Special Edition Varun Santosh Download PDF
No ratings yet
Network Virtualization For Dummies®, VMware 3rd Special Edition Varun Santosh Download PDF
54 pages
Furniture Management System Project Report1
No ratings yet
Furniture Management System Project Report1
46 pages
Optimize Query Performance: 5.1 Lab - Understanding The Execution Plan
No ratings yet
Optimize Query Performance: 5.1 Lab - Understanding The Execution Plan
14 pages
Db2 SQL Tuning Tips
100% (1)
Db2 SQL Tuning Tips
11 pages
Querry Optimization
No ratings yet
Querry Optimization
13 pages
SQL Fundamentals
No ratings yet
SQL Fundamentals
27 pages
SQL Interview
No ratings yet
SQL Interview
9 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
27 pages
Department of Computer Science and Engineering: Comprehensive Viva-Voce Database Management System
No ratings yet
Department of Computer Science and Engineering: Comprehensive Viva-Voce Database Management System
9 pages
Hitachi
No ratings yet
Hitachi
7 pages
Pandas
No ratings yet
Pandas
13 pages
Spark Commands
No ratings yet
Spark Commands
1 page
6 Tips For Better SQL Query Optimization (With Example Code)
No ratings yet
6 Tips For Better SQL Query Optimization (With Example Code)
4 pages
Basic SQL Interview Questions
No ratings yet
Basic SQL Interview Questions
18 pages
Lab 06
No ratings yet
Lab 06
8 pages
Top SQL Interview Questions With Solutions
No ratings yet
Top SQL Interview Questions With Solutions
9 pages
More Senior SQL Questions
No ratings yet
More Senior SQL Questions
4 pages
File, Directory, and Filesystem Commands: Command Name: - List Directory Contents
No ratings yet
File, Directory, and Filesystem Commands: Command Name: - List Directory Contents
8 pages
SQL Interview Questions & Answers
No ratings yet
SQL Interview Questions & Answers
6 pages
SQL Tuning
No ratings yet
SQL Tuning
27 pages
Creating A Linux EC2 Instance and Connect
No ratings yet
Creating A Linux EC2 Instance and Connect
7 pages
Internship Report Sample 4
No ratings yet
Internship Report Sample 4
40 pages
SQL Questions From Basic To Advanced
No ratings yet
SQL Questions From Basic To Advanced
13 pages
SQL 100 Interview Questions and Answers
No ratings yet
SQL 100 Interview Questions and Answers
28 pages
Deploy A Static Website On The Render Cloud
No ratings yet
Deploy A Static Website On The Render Cloud
11 pages
SQL Query Optimization Tips
No ratings yet
SQL Query Optimization Tips
9 pages
Alfresco One 5.1 On-Premises Reference Architecture
No ratings yet
Alfresco One 5.1 On-Premises Reference Architecture
20 pages
SQL
No ratings yet
SQL
12 pages
RDBMS Rapid Fire Sheet Questions (Copy)
No ratings yet
RDBMS Rapid Fire Sheet Questions (Copy)
5 pages
Database Notes For Interview
No ratings yet
Database Notes For Interview
7 pages
Genesys Management Layer Guide
No ratings yet
Genesys Management Layer Guide
155 pages
Oracle 12c DBA Course Overview
No ratings yet
Oracle 12c DBA Course Overview
19 pages
Query
No ratings yet
Query
5 pages
TAFJ R19 Release Notes
No ratings yet
TAFJ R19 Release Notes
14 pages
??? ????????? ?????????
No ratings yet
??? ????????? ?????????
6 pages
How To Install Snort NIDS On Ubuntu Linux
No ratings yet
How To Install Snort NIDS On Ubuntu Linux
9 pages
CSE IIIYearAutonomousSyllabus PDF
No ratings yet
CSE IIIYearAutonomousSyllabus PDF
62 pages
SQL Questions
No ratings yet
SQL Questions
14 pages
Interview Questions Chatgpt
No ratings yet
Interview Questions Chatgpt
3 pages
Lab 01: Computer Networks Lab Statement Purpose
No ratings yet
Lab 01: Computer Networks Lab Statement Purpose
7 pages
MS SQL Server Interview Questions
100% (4)
MS SQL Server Interview Questions
17 pages
DWDM Unit III
No ratings yet
DWDM Unit III
14 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
4 pages
Query Optimization in Databases
No ratings yet
Query Optimization in Databases
6 pages
Building Advanced App - OutStagram
No ratings yet
Building Advanced App - OutStagram
68 pages
Perofrmance and Indexes Discussion Questions Solutions PDF
No ratings yet
Perofrmance and Indexes Discussion Questions Solutions PDF
5 pages
Week3 Recap OOP
No ratings yet
Week3 Recap OOP
56 pages
Big Data
No ratings yet
Big Data
16 pages
Case Study OWB To ODI 12c Migration
No ratings yet
Case Study OWB To ODI 12c Migration
1 page
Java Developer Career Profile
No ratings yet
Java Developer Career Profile
6 pages
SQL Interview Questions and Answers
No ratings yet
SQL Interview Questions and Answers
5 pages
Global NetAcad Instance - Request Voucher - Networking Academy
No ratings yet
Global NetAcad Instance - Request Voucher - Networking Academy
4 pages
Z5&Z6 - System Recovery Guide - V2.0 - EN
No ratings yet
Z5&Z6 - System Recovery Guide - V2.0 - EN
12 pages
MASTG Atomic Tests Proposal
No ratings yet
MASTG Atomic Tests Proposal
12 pages
Java Accenture Questions
No ratings yet
Java Accenture Questions
12 pages
DevOps Engineer Resume: Shyamanand Kumar
No ratings yet
DevOps Engineer Resume: Shyamanand Kumar
3 pages
Barracuda Message Archiver TDS US
No ratings yet
Barracuda Message Archiver TDS US
4 pages
SQL Optimaztion
No ratings yet
SQL Optimaztion
3 pages
IT Knowledge Assessment
100% (2)
IT Knowledge Assessment
12 pages
Ana Vasquez
No ratings yet
Ana Vasquez
1 page
70+ SQL Interview Questions
No ratings yet
70+ SQL Interview Questions
19 pages
UiPath-ADPv1 Cheat Sheet - Remote Debugging
No ratings yet
UiPath-ADPv1 Cheat Sheet - Remote Debugging
1 page
Complete SQL Interview Questions Guide
No ratings yet
Complete SQL Interview Questions Guide
5 pages
Kratikal DBMS
No ratings yet
Kratikal DBMS
5 pages
SQL Questionnaire
No ratings yet
SQL Questionnaire
8 pages