Practice Test One
Practice Test One
Correct
Which of the following commands can a data engineer use to compact small data files
of a Delta table into larger ones ?
PARTITION BY
ZORDER BY
COMPACT
VACUUM
Overall explanation
Delta Lake can improve the speed of read queries from a table. One way to improve
this speed is by compacting small files into larger ones. You trigger compaction by
running the OPTIMIZE command
Reference: https://docs.databricks.com/sql/language-manual/delta-optimize.html
Lecture
Hands-on
Domain
Databricks Lakehouse Platform
Question 2
Correct
A data engineer is trying to use Delta time travel to rollback a table to a
previous version, but the data engineer received an error that the data files are
no longer present.
Which of the following commands was run on the table that caused deleting the data
files?
OPTIMIZE
ZORDER BY
DEEP CLONE
DELETE
Overall explanation
Running the VACUUM command on a Delta table deletes the unused data files older
than a specified data retention period. As a result, you lose the ability to time
travel back to any version older than that retention threshold.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Lecture
Hands-on
Domain
Databricks Lakehouse Platform
Question 3
Correct
In Delta Lake tables, which of the following is the primary format for the data
files?
Delta
JSON
Hive-specific format
Overall explanation
Delta Lake builds upon standard data formats. Delta lake table gets stored on the
storage in one or more data files in Parquet format, along with transaction logs in
JSON format.
Reference: https://docs.databricks.com/delta/index.html
Lecture
Hands-on
Domain
Databricks Lakehouse Platform
Question 4
Correct
Which of the following locations hosts the Databricks web application ?
Data plane
Databricks Filesystem
Databricks-managed cluster
Overall explanation
According to the Databricks Lakehouse architecture, Databricks workspace is
deployed in the control plane along with Databricks services like Databricks web
application (UI), Cluster manager, workflow service, and notebooks.
Reference: https://docs.databricks.com/getting-started/overview.html
Lecture
Domain
Databricks Lakehouse Platform
Question 5
Correct
In Databricks Repos (Git folders), which of the following operations a data
engineer can use to update the local version of a repo from its remote Git
repository ?
Clone
Commit
Merge
Push
Overall explanation
The git Pull operation is used to fetch and download content from a remote
repository and immediately update the local repository to match that content.
References:
https://docs.databricks.com/repos/index.html
https://github.com/git-guides/git-pull
Hands-on
Domain
Databricks Lakehouse Platform
Question 6
Correct
According to the Databricks Lakehouse architecture, which of the following is
located in the customer's cloud account?
Notebooks
Repos
Workflows
Overall explanation
When the customer sets up a Spark cluster, the cluster virtual machines are
deployed in the data plane in the customer's cloud account.
Reference: https://docs.databricks.com/getting-started/overview.html
Lecture
Domain
Databricks Lakehouse Platform
Question 7
Correct
Which of the following best describes Databricks Lakehouse?
Platform that helps reduce the costs of storing organization’s open-format data
files in the cloud.
Overall explanation
Databricks Lakehouse is a unified analytics platform that combines the best
elements of data lakes and data warehouses. So, in the Lakehouse, you can work on
data engineering, analytics, and AI, all in one platform.
Reference: https://www.databricks.com/glossary/data-lakehouse
Lecture
Domain
Databricks Lakehouse Platform
Question 8
Correct
If the default notebook language is SQL, which of the following options a data
engineer can use to run a Python code in this SQL Notebook ?
This is not possible! They need to change the default language of the notebook to
Python
Databricks detects cells language automatically, so they can write Python syntax in
any cell
They can add %language magic command at the start of a cell to force language
detection.
Overall explanation
By default, cells use the default language of the notebook. You can override the
default language in a cell by using the language magic command at the beginning of
a cell. The supported magic commands are: %python, %sql, %scala, and %r.
Reference: https://docs.databricks.com/notebooks/notebooks-code.html
Hands-on
Domain
Databricks Lakehouse Platform
Question 9
Correct
Which of the following tasks is not supported by Databricks Repos (Git folders),
and must be performed in your Git provider ?
Overall explanation
The following tasks are not supported by Databricks Repos, and must be performed in
your Git provider:
Delete branches
* NOTE: Recently, merge and rebase branches have become supported in Databricks
Repos. However, this may still not be updated in the current exam version.
Reference: https://docs.databricks.com/repos/index.html
Hands-on
Domain
Databricks Lakehouse Platform
Question 10
Correct
Which of the following statements is Not true about Delta Lake ?
Overall explanation
It is not true that Delta Lake builds upon XML format. It builds upon Parquet and
JSON formats
Reference: https://docs.databricks.com/delta/index.html
Lecture
Hands-on
Domain
Databricks Lakehouse Platform
Question 11
Correct
How long is the default retention period of the VACUUM command ?
0 days
30 days
90 days
365 days
Overall explanation
By default, the retention threshold of the VACUUM command is 7 days. This means
that VACUUM operation will prevent you from deleting files less than 7 days old,
just to ensure that no long-running operations are still referencing any of the
files to be deleted.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Lecture
Hands-on
Domain
Databricks Lakehouse Platform
Question 12
Incorrect
The data engineering team has a Delta table called employees that contains the
employees personal information including their gross salaries.
Which of the following code blocks will keep in the table only the employees having
a salary greater than 3000 ?
SELECT CASE WHEN salary <= 3000 THEN DELETE ELSE UPDATE END FROM employees;
Correct answer
DELETE FROM employees WHERE salary <= 3000;
Overall explanation
In order to keep only the employees having a salary greater than 3000, we must
delete the employees having salary less than or equal 3000. To do so, use the
DELETE statement:
Reference: https://docs.databricks.com/sql/language-manual/delta-delete-from.html
Domain
ELT with Spark SQL and Python
Question 13
Correct
A data engineer wants to create a relational object by pulling data from two
tables. The relational object must be used by other data engineers in other
sessions on the same cluster only. In order to save on storage costs, the date
engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
Temporary view
External table
Managed table
View
Overall explanation
In order to avoid copying and storing physical data, the data engineer must create
a view object. A view in databricks is a virtual table that has no physical data.
It’s just a saved SQL query against actual tables.
The view type should be Global Temporary view that can be accessed in other
sessions on the same cluster. Global Temporary views are tied to a cluster
temporary database called global_temp.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-
create-view.html
Lecture
Hands-on
Domain
ELT with Spark SQL and Python
Question 14
Correct
A data engineer has developed a code block to completely reprocess data based on
the following if-condition in Python:
Which of the following changes should be made to the code block to fix this error ?
if <expr>:
<statement>
Equals: a == b
Not Equals: a != b
<, <=, >, >=
To combine conditional statements, you can use the following logical operators:
and
or
Reference: https://www.w3schools.com/python/python_conditions.asp
Domain
ELT with Spark SQL and Python
Question 15
Incorrect
Fill in the below blank to successfully create a table in Databricks using data
from an existing PostgreSQL database:
DELTA
dbserver
cloudfiles
Overall explanation
Using the JDBC library, Spark SQL can extract data from any existing relational
database that supports JDBC. Examples include mysql, postgres, SQLite, and more.
Reference: https://learn.microsoft.com/en-us/azure/databricks/external-data/jdbc
Lecture
Domain
ELT with Spark SQL and Python
Question 16
Correct
Which of the following commands can a data engineer use to create a new table along
with a comment ?
Syntax:
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-
create-table-using.html
Lecture
Domain
ELT with Spark SQL and Python
Question 17
Correct
A junior data engineer usually uses INSERT INTO command to write data into a Delta
table. A senior data engineer suggested using another command that avoids writing
of duplicate records.
Which of the following commands is the one suggested by the senior data engineer ?
UPDATE
COPY INTO
INSERT OR OVERWRITE
Overall explanation
MERGE INTO allows to merge a set of updates, insertions, and deletions based on a
source table into a target Delta table. With MERGE INTO, you can avoid inserting
the duplicate records when writing into Delta tables.
References:
https://docs.databricks.com/sql/language-manual/delta-merge-into.html
https://docs.databricks.com/delta/merge.html#data-deduplication-when-writing-into-
delta-tables
Hands-on
Domain
ELT with Spark SQL and Python
Question 18
Correct
A data engineer is designing a Delta Live Tables pipeline. The source system
generates files containing changes captured in the source data. Each change event
has metadata indicating whether the specified record was inserted, updated, or
deleted. In addition to a timestamp column indicating the order in which the
changes happened. The data engineer needs to update a target table based on these
change events.
Which of the following commands can the data engineer use to best solve this
problem?
MERGE INTO
UPDATE
COPY INTO
cloud_files
Overall explanation
The events described in the question represent Change Data Capture (CDC) feed. CDC
is logged at the source as events that contain both the data of the records along
with metadata information:
Operation column indicating whether the specified record was inserted, updated, or
deleted
Sequence column that is usually a timestamp indicating the order in which the
changes happened
You can use the APPLY CHANGES INTO statement to use Delta Live Tables CDC
functionality
Reference: https://docs.databricks.com/workflows/delta-live-tables/delta-live-
tables-cdc.html
Lecture
Hands-on
Domain
ELT with Spark SQL and Python
Question 19
Correct
In PySpark, which of the following commands can you use to query the Delta table
employees created in Spark SQL?
spark.sql("employees")
spark.format(“sql”).read("employees")
Overall explanation
spark.table() function returns the specified Spark SQL table as a PySpark DataFrame
Reference:
https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/
session.html#SparkSession.table
Hands-on
Domain
ELT with Spark SQL and Python
Question 20
Correct
Which of the following code blocks can a data engineer use to create a user defined
function (UDF) ?
RETURN value +1
RETURNS INTEGER
RETURNS INTEGER
RETURNS INTEGER
value +1;
Overall explanation
The correct syntax to create a UDF is:
Reference: https://docs.databricks.com/udf/index.html
Hands-on
Domain
ELT with Spark SQL and Python
Question 21
Correct
When dropping a Delta table, which of the following explains why only the table's
metadata will be deleted, while the data files will be kept in the storage ?
The user running the command has no permission to delete the data files
Delta prevents deleting files less than retention threshold, just to ensure that no
long-running operations are still referencing any of the files to be deleted
Overall explanation
External (unmanaged) tables are tables whose data is stored in an external storage
path by using a LOCATION clause.
When you run DROP TABLE on an external table, only the table's metadata is deleted,
while the underlying data files are kept.
Reference: https://docs.databricks.com/lakehouse/data-objects.html#what-is-an-
unmanaged-table
Lecture
Hands-on
Domain
ELT with Spark SQL and Python
Question 22
Correct
Given the two tables students_course_1 and students_course_2. Which of the
following commands can a data engineer use to get all the students from the above
two tables without duplicate records ?
Syntax:
subquery1
UNION [ ALL | DISTINCT ]
subquery2
If DISTINCT is specified the result does not contain any duplicate rows. This is
the default.
Note that both subqueries must have the same number of columns and share a least
common type for each respective column.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-
select-setops.html
Hands-on
Domain
ELT with Spark SQL and Python
Question 23
Correct
Given the following command:
dbfs:/user/hive/db_hr
dbfs:/user/hive/databases/db_hr.db
dbfs:/user/hive/databases
dbfs:/user/hive
Overall explanation
Since we are creating the database here without specifying a LOCATION clause, the
database will be created in the default warehouse directory under
dbfs:/user/hive/warehouse
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-
create-schema.html
Lecture
Hands-on
Domain
ELT with Spark SQL and Python
Question 24
Correct
Given the following table faculties
Fill in the below blank to get the students enrolled in less than 3 courses from
the array column students
SELECT
faculty_id,
students,
___________ AS few_courses_students
FROM faculties
ELSE NULL
END
Overall explanation
filter(input_array, lamda_function) is a higher order function that returns an
output array from an input array by extracting elements for which the predicate of
a lambda function holds.
Example:
output: [1, 3]
References:
https://docs.databricks.com/sql/language-manual/functions/filter.html
https://docs.databricks.com/optimizations/higher-order-lambda-functions.html
Hands-on
Domain
ELT with Spark SQL and Python
Question 25
Correct
Given the following Structured Streaming query:
(spark.table("orders")
.withColumn("total_after_tax", col("total")+col("tax"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.______________
.table("new_orders")
)
Fill in the blank to make the query executes a micro-batch to process data every 2
minutes
trigger(once=”2 minutes”)
processingTime(”2 minutes")
trigger(”2 minutes")
trigger()
Overall explanation
In Spark Structured Streaming, in order to process data in micro-batches at the
user-specified intervals, you can use processingTime keyword. It allows to specify
a time duration as a string.
Reference:
https://docs.databricks.com/structured-streaming/triggers.html#configure-
structured-streaming-trigger-intervals
Lecture
Hands-on
Domain
Incremental Data Processing
Question 26
Incorrect
Which of the following is used by Auto Loader to load data incrementally?
DEEP CLONE
Multi-hop architecture
COPY INTO
Correct answer
Spark Structured Streaming
Overall explanation
Auto Loader is based on Spark Structured Streaming. It provides a Structured
Streaming source called cloudFiles.
Reference: https://docs.databricks.com/ingestion/auto-loader/index.html
Lecture
Hands-on
Domain
Incremental Data Processing
Question 27
Correct
Which of the following statements best describes Auto Loader ?
Auto loader allows applying Change Data Capture (CDC) feed to update tables based
on changes captured in source data.
Auto loader defines data quality expectations on the contents of a dataset, and
reports the records that violate these expectations in metrics.
Auto loader enables efficient insert, update, deletes, and rollback capabilities by
adding a storage layer that provides better data reliability to data lakes.
Overall explanation
Auto Loader incrementally and efficiently processes new data files as they arrive
in cloud storage.
Reference: https://docs.databricks.com/ingestion/auto-loader/index.html
Lecture
Hands-on
Domain
Incremental Data Processing
Question 28
Incorrect
A data engineer has defined the following data quality constraint in a Delta Live
Tables pipeline:
Fill in the above blank so records violating this constraint will be added to the
target table, and reported in metrics
Correct answer
There is no need to add ON VIOLATION clause. By default, records violating the
constraint will be kept, and reported as invalid in the event log
Overall explanation
By default, records that violate the expectation are added to the target dataset
along with valid records, but violations will be reported in the event log
Reference:
https://learn.microsoft.com/en-us/azure/databricks/workflows/delta-live-tables/
delta-live-tables-expectations
Hands-on
Domain
Incremental Data Processing
Question 29
Correct
The data engineer team has a DLT pipeline that updates all the tables once and then
stops. The compute resources of the pipeline continue running to allow for quick
testing.
Which of the following best describes the execution modes of this DLT pipeline ?
The DLT pipeline executes in Continuous Pipeline mode under Production mode.
The DLT pipeline executes in Continuous Pipeline mode under Development mode.
The DLT pipeline executes in Triggered Pipeline mode under Production mode.
Overall explanation
Triggered pipelines update each table with whatever data is currently available and
then they shut down.
In Development mode, the Delta Live Tables system ease the development process by
Reusing a cluster to avoid the overhead of restarts. The cluster runs for two hours
when development mode is enabled.
Disabling pipeline retries so you can immediately detect and fix errors.
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-
concepts.html
Hands-on
Domain
Incremental Data Processing
Question 30
Correct
Which of the following will utilize Gold tables as their source?
Silver tables
Auto loader
Bronze tables
Streaming jobs
Overall explanation
Gold tables provide business level aggregates often used for reporting and
dashboarding, or even for Machine learning
Reference:
https://www.databricks.com/glossary/medallion-architecture
Lecture
Domain
Incremental Data Processing
Question 31
Correct
Which of the following code blocks can a data engineer use to query the existing
streaming table events ?
spark.readStream("events")
spark.read
.table("events")
.table("events")
spark.readStream()
.table("events")
spark.stream
.read("events")
Overall explanation
Delta Lake is deeply integrated with Spark Structured Streaming. You can load
tables as a stream using:
spark.readStream.table(table_name)
Reference: https://docs.databricks.com/structured-streaming/delta-lake.html
Lecture
Hands-on
Domain
Incremental Data Processing
Question 32
Correct
In multi-hop architecture, which of the following statements best describes the
Bronze layer ?
Overall explanation
Bronze tables contain data in its rawest format ingested from various sources
(e.g., JSON files, Operational Databaes, Kakfa stream, ...)
Reference:
https://www.databricks.com/glossary/medallion-architecture
Lecture
Hands-on
Domain
Incremental Data Processing
Question 33
Correct
Given the following Structured Streaming query
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(ordersLocation)
.writeStream
.option("checkpointLocation", checkpointPath)
.table("uncleanedOrders")
)
Which of the following best describe the purpose of this query in a multi-hop
architecture?
The query is performing data transfer from a Gold table into a production
application
Overall explanation
The query here is using Autoloader (cloudFiles) to load raw json data from
ordersLocation into the Bronze table uncleanedOrders
References:
https://www.databricks.com/glossary/medallion-architecture
https://docs.databricks.com/ingestion/auto-loader/index.html
Study materials from our exam preparation course on Udemy:
Lecture
Hands-on
Domain
Incremental Data Processing
Question 34
Correct
A data engineer has the following query in a Delta Live Tables pipeline:
Which of the following changes should be made to this query to successfully start
the DLT pipeline ?
AS
FROM LIVE.cleaned_sales
GROUP BY store_id
Reference: https://docs.databricks.com/workflows/delta-live-tables/delta-live-
tables-sql-ref.html
Hands-on
Domain
Incremental Data Processing
Question 35
Correct
A data engineer has defined the following data quality constraint in a Delta Live
Tables pipeline:
Fill in the above blank so records violating this constraint will be dropped, and
reported in metrics
Overall explanation
With ON VIOLATION DROP ROW, records that violate the expectation are dropped, and
violations are reported in the event log
Reference:
https://learn.microsoft.com/en-us/azure/databricks/workflows/delta-live-tables/
delta-live-tables-expectations
Hands-on
Domain
Incremental Data Processing
Question 36
Correct
Which of the following compute resources is available in Databricks SQL ?
Single-node clusters
Multi-nodes clusters
On-premises clusters
SQL engines
Overall explanation
Compute resources are infrastructure resources that provide processing capabilities
in the cloud. A SQL warehouse is a compute resource that lets you run SQL commands
on data objects within Databricks SQL.
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html
Hands-on
Domain
Production Pipelines
Question 37
Correct
Which of the following is the benefit of using the Auto Stop feature of Databricks
SQL warehouses ?
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html
Hands-on
Domain
Production Pipelines
Question 38
Correct
Which of the following alert destinations is Not supported in Databricks SQL ?
Slack
Webhook
Microsoft Teams
Overall explanation
SMS is not supported as an alert destination in Databricks SQL . While, email,
webhook, Slack, and Microsoft Teams are supported alert destinations in Databricks
SQL.
Reference: https://docs.databricks.com/sql/admin/alert-destinations.html
Hands-on
Domain
Production Pipelines
Question 39
Correct
A data engineering team has a long-running multi-tasks Job. The team members need
to be notified when the run of this job completes.
Which of the following approaches can be used to send emails to the team members
when the job completes ?
They can use Job API to programmatically send emails according to each task status
Your answer is correct
They can configure email notifications settings in the job page
Only Job owner can be configured to be notified when the job completes
They can configure email notifications settings per notebook in the task page
Overall explanation
Databricks Jobs supports email notifications to be notified in the case of job
start, success, or failure. Simply, click Edit email notifications from the details
panel in the Job page. From there, you can add one or more email addresses.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#alerts-job
Hands-on
Domain
Production Pipelines
Question 40
Correct
A data engineer wants to increase the cluster size of an existing Databricks SQL
warehouse.
Which of the following is the benefit of increasing the cluster size of Databricks
SQL warehouses ?
The cluster size of SQL warehouses is not configurable. Instead, they can increase
the number of clusters
The cluster size can not be changed for existing SQL warehouses. Instead, they can
enable the auto-scaling option.
Overall explanation
Cluster Size represents the number of cluster workers and size of compute resources
available to run your queries and dashboards. To reduce query latency, you can
increase the cluster size.
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html#cluster-size-1
Hands-on
Domain
Production Pipelines
Question 41
Correct
Which of the following describes Cron syntax in Databricks Jobs ?
Overall explanation
To define a schedule for a Databricks job, you can either interactively specify the
period and starting time, or write a Cron Syntax expression. The Cron Syntax allows
to represent complex job schedule that can be defined programmatically
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#schedule-a-job
Hands-on
Domain
Production Pipelines
Question 42
Incorrect
The data engineer team has a DLT pipeline that updates all the tables at defined
intervals until manually stopped. The compute resources terminate when the pipeline
is stopped.
Which of the following best describes the execution modes of this DLT pipeline ?
Correct answer
The DLT pipeline executes in Continuous Pipeline mode under Production mode.
The DLT pipeline executes in Triggered Pipeline mode under Production mode.
The DLT pipeline executes in Triggered Pipeline mode under Development mode.
Overall explanation
Continuous pipelines update tables continuously as input data changes. Once an
update is started, it continues to run until the pipeline is shut down.
Restarts the cluster for recoverable errors (e.g., memory leak or stale
credentials).
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-
concepts.html
Hands-on
Domain
Production Pipelines
Question 43
Correct
Which part of the Databricks Platform can a data engineer use to grant permissions
on tables to users ?
Data Studio
Workflows
DBFS
Overall explanation
Data Explorer in Databricks SQL allows you to manage data object permissions. This
includes granting privileges on tables and databases to users or groups of users.
Reference: https://docs.databricks.com/security/access-control/data-acl.html#data-
explorer
Hands-on
Domain
Data Governance
Question 44
Correct
Which of the following commands can a data engineer use to grant full permissions
to the HR team on the table employees ?
Overall explanation
ALL PRIVILEGES is used to grant full permissions on an object to a user or group of
users. It is translated into all the below privileges:
SELECT
CREATE
MODIFY
USAGE
READ_METADATA
Reference: https://docs.databricks.com/security/access-control/table-acls/object-
privileges.html#privileges
Study materials from our exam preparation course on Udemy:
Lecture
Hands-on
Domain
Data Governance
Question 45
Correct
A data engineer uses the following SQL query:
Which of the following describes the ability given by the MODIFY privilege ?
None of these options correctly describe the ability given by the MODIFY privilege
Overall explanation
The MODIFY privilege gives the ability to add, delete, and modify data to or from
an object.
Reference: https://docs.databricks.com/security/access-control/table-acls/object-
privileges.html#privileges
Lecture
Hands-on
Domain
Data Governance