BigQuery Data Engineer Interview Questions (3+ Years Experience)
Core BigQuery Concepts
1. What are the different types of tables in BigQuery?
- Standard table
- Partitioned table
- Clustered table
- External table
- Temporary table
- Materialized view
2. How does BigQuery store and query data?
- Columnar storage
- Dremel execution engine
- Massively parallel processing (MPP)
3. What is the difference between partitioning and clustering?
- Partitioning: Divides table by a column (e.g., date)
- Clustering: Organizes rows within partitions
- Used for reducing query scan costs and improving performance
4. How would you implement incremental loading in BigQuery?
- Use MERGE statement
- Load only data with new updated_at
- Use audit columns or a metadata tracking table
SQL & Query Optimization
5. How do you optimize a slow BigQuery query?
- Use EXPLAIN
- Avoid SELECT *
- Filter on partition column
- Use clustering
- Break queries into stages with temp tables
6. What does the WITH clause do in BigQuery?
- Common Table Expressions (CTEs)
- Helps modularize and simplify queries
7. How do you avoid scanning too much data?
- Use partition filters
- Select only required columns
- Use LIMIT for testing
- Use --dry_run to estimate scan cost
Pipeline Design & ETL
8. Explain a pipeline you built using BigQuery.
- Example: GCS Staging Table Transform with SQL Final Table
- Orchestrated using Airflow
- Stored procedures for modular logic
9. How do you handle schema evolution in BigQuery?
- Use ALTER TABLE to add columns
- Avoid SELECT *
- Backfill or use defaults
10. Have you worked with dbt or Airflow?
- Yes: Used BigQueryInsertJobOperator in Airflow
- dbt for SQL model management, testing, documentation
11. How do you track BigQuery job failures?
- Use INFORMATION_SCHEMA.JOBS
- Use Cloud Logging
- Alerts via Airflow callbacks
Cost Management & Security
12. How is BigQuery pricing calculated?
- Storage cost per TB per month
- Query cost per TB scanned (on-demand or flat-rate)
13. How do you reduce BigQuery costs?
- Partition & cluster tables
- Use --dry_run
- Materialized views
- Archive unused data
14. How would you secure a BigQuery dataset?
- IAM roles: viewer/editor roles
- Dataset-level access controls
- Column-level and row-level security
Scenario & Behavioral Questions
15. Tell me about a time you fixed a broken pipeline.
- Describe: Issue Root cause Resolution Preventive step
16. How do you monitor data quality in BigQuery?
- Data validation queries
- dbt tests
- Airflow sensors or alerts
17. How do you test BigQuery transformations?
- Unit tests on sample data
- Staging vs final table validation
- Use assertions or row comparisons
Bonus Advanced Questions
- How does BigQuery handle joins internally? Broadcast vs shuffle joins?
- Difference between TEMP tables, CTEs, and materialized views?
- How do you handle late-arriving data in partitioned tables?
- What are the performance implications of using UNNEST()?