Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
194 views131 pages

Databricks Platform & Workspace Guide

The Databricks Tutorial provides an overview of the Databricks Intelligence Platform, highlighting its features such as unified data analytics, lakehouse architecture, and support for AI and machine learning. It explains the relationship between Databricks accounts and workspaces, the platform's architecture, and the roles and responsibilities of users. Additionally, it covers integration with cloud providers, setup instructions, pricing tiers, and collaboration features within Databricks Notebooks.

Uploaded by

Deeksha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
194 views131 pages

Databricks Platform & Workspace Guide

The Databricks Tutorial provides an overview of the Databricks Intelligence Platform, highlighting its features such as unified data analytics, lakehouse architecture, and support for AI and machine learning. It explains the relationship between Databricks accounts and workspaces, the platform's architecture, and the roles and responsibilities of users. Additionally, it covers integration with cloud providers, setup instructions, pricing tiers, and collaboration features within Databricks Notebooks.

Uploaded by

Deeksha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Databricks Tutorial

1|Page
Databricks Tutorial

The Databricks Intelligence Platform provides:

• Unified Data Analytics – Combines data engineering, data science, and business
analytics in one platform.

• Lakehouse Architecture – Merges data lakes and data warehouses for structured and
unstructured data processing.

• AI & Machine Learning – Supports ML workflows with AutoML, Feature Store, and
MLflow for experiment tracking.

• Data Engineering – Optimized ETL workflows with Delta Live Tables and Apache
Spark.

• Business Intelligence (BI) – Integration with visualization tools like Power BI, Tableau,
and Databricks SQL.

• Data Governance & Security – Unity Catalog for access control, data lineage, and
compliance.

• Collaborative Workspaces – Shared notebooks, real-time collaboration, and multi-


language support (Python, SQL, Scala, R).

• Serverless & Scalable Compute – Auto-scaling clusters and serverless computing for
cost efficiency.

2|Page
Databricks Tutorial
• Streaming & Real-time Analytics – Apache Spark Structured Streaming for real-time
data processing.

• Integration with Cloud Providers – Works on AWS, Azure, and Google Cloud.

1. Relation Between Databricks Account & Workspace

Databricks Account

• The Databricks account is the highest-level entity in Databricks.

• It manages billing, authentication, and workspace provisioning.

• A single account can have multiple workspaces across different cloud providers
(AWS, Azure, GCP).

Databricks Workspace

• A workspace is an isolated environment within a Databricks account where users


perform data analytics and machine learning tasks.

• Each workspace has its own notebooks, clusters, jobs, and security policies.

3|Page
Databricks Tutorial
• Workspaces are created and managed at the account level.

Key Relationship

• A Databricks account can have multiple workspaces.

• Each workspace operates independently but is billed through the same Databricks
account.

• Workspaces can be configured differently based on security, compute needs, and


data access policies.

2. What is Databricks Workspace?

Definition

Databricks Workspace is a collaborative, cloud-based environment for data analytics,


engineering, and machine learning. It provides tools for:

• Data ingestion from multiple sources

• Data processing using Apache Spark

• Machine Learning & AI with MLflow

• Visualization with Databricks SQL

Key Features

• Notebooks: Interactive notebooks supporting Python, SQL, Scala, and R.

• Clusters: Managed Spark clusters for scalable computing.

• Jobs: Automated workflows for ETL and ML training.

• Data Governance: Unity Catalog for security and access control.

• Workspace API: Automate workspace operations via REST API.

3. Databricks High-Level Architecture

The Databricks platform is built on a Lakehouse Architecture, combining data lakes and
warehouses.

Key Components

1. Data Sources

o Cloud storage: AWS S3, Azure Data Lake, GCP Storage

o Databases: SQL, NoSQL

o Streaming: Kafka, IoT data

4|Page
Databricks Tutorial
2. Data Ingestion & Processing

o Apache Spark for big data processing

o Delta Lake for structured and unstructured data

o Databricks Notebooks for ETL and ML tasks

3. Storage Layer (Delta Lake)

o ACID transactions for reliability

o Schema enforcement & governance

4. Compute & Processing

o Scalable, auto-managed Spark clusters

o Serverless compute for optimized performance

5. Machine Learning & AI

o MLflow for experiment tracking

o Feature Store for reusable ML features

6. BI & Visualization

o Databricks SQL, Power BI, Tableau integration

7. Security & Governance

o Unity Catalog for access control

o Role-based permissions

4. What is Control and Data Plane in Databricks?

Control Plane

• Managed by Databricks (not within user’s cloud account).

• Handles workspace UI, authentication, job scheduling, and cluster management.

• Stores metadata, notebook code, and configurations.

• Hosted by Databricks on AWS, Azure, or GCP.

Data Plane

• Runs in the customer’s cloud environment (AWS, Azure, GCP).

• Contains compute resources (clusters, VMs) that process data.

• Stores actual data in cloud storage (S3, ADLS, GCS).

5|Page
Databricks Tutorial
• Fully under customer control with network isolation.

Separation of Control & Data Plane

• Improves security and compliance by keeping data in the customer’s cloud.

• Allows Databricks to manage the platform without accessing customer data.

• Serverless Databricks has a shared Data Plane, but Enterprise deployments have full
data isolation.

5. Roles and Responsibilities in Databricks

1. Data Engineer

• Builds ETL pipelines using Spark & Delta Lake.

• Ensures data ingestion, transformation, and storage.

• Works with Delta Live Tables for real-time processing.

2. Data Scientist

• Develops machine learning models using MLflow.

• Uses Databricks notebooks for exploratory data analysis (EDA).

• Deploys models using Feature Store.

3. Data Analyst

• Writes SQL queries on Databricks SQL.

• Creates dashboards and reports for business insights.

• Works with BI tools like Power BI, Tableau.

4. Platform Administrator

• Manages workspace configurations, security policies, and clusters.

• Implements role-based access control (RBAC) via Unity Catalog.

• Monitors compute costs and performance optimization.

5. DevOps & Cloud Engineer

• Ensures cloud infrastructure integration with Databricks.

• Implements CI/CD pipelines for deployment automation.

• Manages network security and compliance.

6|Page
Databricks Tutorial
Conclusion

Databricks is a powerful cloud platform enabling data analytics, AI, and data engineering at
scale. Its workspace-based architecture, separation of control & data planes, and
Lakehouse model provide high performance, security, and collaboration.

1. Setup Databricks with AWS and GCP

Step 1: Create a Databricks Account

1. Visit Databricks and sign up.

2. Select AWS or Google Cloud as your cloud provider.

3. Create a Databricks workspace from the Databricks account console.

Step 2: Configure AWS Integration

1. IAM Role Creation:

o Create an IAM role in AWS with S3, EC2, and KMS permissions.

o Attach the policy:

o Enable cross-account role access for Databricks.

2. Deploy Databricks on AWS:

o Go to the AWS Marketplace and search for Databricks.

o Choose Databricks E2 or Serverless based on pricing needs.

o Launch using CloudFormation.

Step 3: Configure Google Cloud Integration

1. Enable Dataproc API and create a GCP project.

7|Page
Databricks Tutorial
2. Create a service account and grant:

o Storage Admin for GCS access.

o Compute Admin for VM management.

3. Deploy Databricks from GCP Console and configure network settings.

3. Setup Databricks with Azure

Step 1: Create an Azure Databricks Workspace

1. Go to Azure Portal → Create a Resource → Search for Azure Databricks.

2. Click Create, then choose:

o Subscription: Select your Azure subscription.

o Resource Group: Create a new or use an existing one.

o Workspace Name: Give it a unique name.

o Region: Choose closest to your users.

Step 2: Configure Networking

1. Select Virtual Network Injection (Optional for custom VNet).

2. Enable Managed VNet for auto-handling of network traffic.

Step 3: Assign User Permissions

1. Open Azure Active Directory (AAD).

2. Assign RBAC roles for Admins, Engineers, and Data Scientists.

3. Enable Unity Catalog for governance (if needed).

Step 4: Launch Databricks & Start Using Notebooks

1. Open the Databricks workspace from Azure.

2. Create a cluster (Auto-Scaling or Serverless).

3. Launch Databricks Notebooks for development.

4. Databricks Tiers and Pricing

Databricks offers different pricing tiers based on compute usage and features.

1. Databricks Pricing Tiers

8|Page
Databricks Tutorial

2. Compute Pricing (DBU - Databricks Unit)

• Serverless Compute: Pay for execution time only.

• Interactive Clusters: Charged per DBU used.

• SQL Warehouse: Pay per query execution.

3. Estimated Cost on Cloud Platforms

• AWS: ~$0.07 – $0.55 per DBU

• Azure: ~$0.10 – $0.60 per DBU

• GCP: ~$0.09 – $0.50 per DBU

Tip: Use Databricks Cost Calculator to estimate pricing based on workloads.

• Databricks Account Console: https://accounts.cloud.databricks.com/


• Databricks Documentation: https://docs.databricks.com/
• Databricks Pricing: https://www.databricks.com/product/pricing

• Azure Databricks Setup Guide: https://learn.microsoft.com/en-


us/azure/databricks/getting-started/
• Databricks Service Principal Guide: https://docs.databricks.com/en/dev-
tools/service-principals.html

1. How to Use Databricks Notebooks?

Databricks Notebooks provide an interactive environment for data analysis, machine


learning, and ETL processing. It supports multiple languages like Python, Scala, SQL, and R
within a single notebook.

Steps to Use a Databricks Notebook:

1. Create a Notebook:

o Navigate to Databricks Workspace → Click Create → Select Notebook.

o Name the notebook and choose a default language (Python, SQL, Scala, R).

2. Attach a Cluster:

o To execute code, attach the notebook to a running Databricks cluster.


9|Page
Databricks Tutorial
o Click the Cluster dropdown in the notebook UI and select an available cluster.

3. Write & Execute Code:

o Enter code in cells and press Shift + Enter to execute.

o Use multiple languages in the same notebook with Magic Commands


(covered later).

4. Save & Share Notebooks:

o Click File → Save or use Ctrl + S.

o Share with team members via email, links, or repositories.

2. What Are Different Types of Cells in a Notebook?

Databricks Notebooks support three main types of cells:

Additional Features in Cells:

✔ Visualization Support: Create charts and graphs from query results.


✔ Collapsible Sections: Organize large notebooks efficiently.

3. What Are Language Magic Commands in Databricks Notebooks?

Databricks Notebooks allow multiple programming languages in one notebook using Magic
Commands.

Common Magic Commands:

10 | P a g e
Databricks Tutorial

Tip: The default language of a notebook can be changed using Magic Commands
dynamically.

4. How Databricks Helps in Collaboration?

Databricks enhances collaboration by enabling teams to work together efficiently in


notebooks and workflows.

Collaboration Features:

1. Real-Time Editing:

• Multiple users can edit a notebook simultaneously, similar to Google Docs.

• See real-time changes made by others.

2. Commenting System:

• Users can add inline comments to notebook cells.

• Use @mentions to notify teammates in discussions.

3. Notebook Sharing & Permissions:

• Share notebooks via URLs, Databricks workspace, or Git integration.

• Set permissions (View, Edit, Run, Manage) for team members.

4. Git Integration:

• Version control with GitHub, Azure DevOps, and Bitbucket.

• Use Databricks Repos for direct Git operations inside Databricks.

5. Databricks Workflows & Jobs:

• Automate data pipelines by scheduling jobs that execute notebooks.

• Enable team workflows for data engineering and ML models.

11 | P a g e
Databricks Tutorial

5. What Is Version History in Databricks?

Databricks automatically tracks changes made to notebooks, allowing users to restore


previous versions.

Features of Version History:

✔ Automatic Checkpoints: Databricks saves autosnapshots when changes are made.


✔ Manual Version Saves: Users can create named versions manually.
✔ Compare Versions: View and restore previous notebook versions if needed.
✔ Revert Changes: Roll back to any previous version easily.

How to Access Version History?

1. Open a Databricks Notebook.

2. Click on “Revision History” (Clock icon at the top-right).

3. Browse through previous versions and select "Restore" if needed.

Databricks on Azure: A Deep Dive

Databricks seamlessly integrates with Azure, providing a fully managed, scalable, and
collaborative data analytics platform. Let’s break down how it works in the background, how
clusters are created, and how Databricks manages compute and storage in Azure.

1. How Databricks Works with Azure in the Background?

When you create an Azure Databricks workspace, it is deployed as a first-party service


within Azure, deeply integrated with Azure Active Directory (AAD), Azure Storage, and
Azure Networking.

Key Behind-the-Scenes Components:

• Azure Databricks Control Plane (Managed by Databricks)

• Azure Databricks Data Plane (Runs in your Azure Subscription)

• Azure Integration (ADLS, AAD, Key Vault, etc.)

Workflow Behind the Scenes:

1. Workspace Creation: When you create a Databricks workspace in Azure, it provisions


a Managed Resource Group in your Azure subscription.

2. Cluster Spin-Up: When a Databricks job runs, Azure provisions Virtual Machines
(VMs) in the background.

12 | P a g e
Databricks Tutorial
3. Networking & Security: Databricks communicates with Azure Storage, AAD, and Key
Vault securely through private endpoints.

4. Billing & Monitoring: Costs are managed through Azure Billing, and usage is tracked
in Azure Monitor.

2. How Databricks Clusters Are Spin-Up Using Azure VMs?

A Databricks cluster consists of multiple Azure Virtual Machines (VMs) that run Apache
Spark workloads.

Cluster Lifecycle in Azure:

1. User Creates a Cluster:

o A cluster is requested via the Databricks UI, API, or Jobs.

2. Azure Spins Up Virtual Machines:

o Based on the VM type and size selected, Azure provisions Virtual Machines
(VMs) in your subscription.

o VMs are deployed in Azure Kubernetes Service (AKS) or Virtual Machine


Scale Sets (VMSS).

3. Databricks Installs Apache Spark:

o The required Spark binaries and dependencies are installed on the nodes.

4. Cluster Execution & Auto-Scaling:

o Databricks dynamically scales the cluster (adds/removes worker nodes)


based on workload demand.

5. Cluster Termination:

o When not in use, clusters auto-terminate to save costs.

Key Cluster Components in Azure:

• Driver Node → Manages cluster execution & distributes tasks.

• Worker Nodes → Execute Spark computations in parallel.

• Databricks Runtime (DBR) → Optimized version of Apache Spark.

3. What is Databricks Managed Resource Group?

13 | P a g e
Databricks Tutorial
When you create an Azure Databricks workspace, Azure automatically creates a dedicated
Managed Resource Group in your subscription.

Purpose of Managed Resource Group:

Contains all the Azure infrastructure needed for Databricks.


Manages Networking, Storage, and Compute resources for Databricks.
Azure manages this group; users shouldn’t modify or delete it manually.

Key Resources Inside Managed Resource Group:

• Azure Virtual Machines (for Databricks Clusters)

• Databricks Virtual Network (VNet) (Handles cluster networking)

• Public IPs & Network Interfaces (For cluster communication)

• Databricks Storage Container (For DBFS - Databricks File System)

4. How Databricks Manages Compute in Azure?

Databricks manages compute by dynamically provisioning and managing Azure Virtual


Machines (VMs) for data processing.

How Compute is Managed?

Elastic Clusters: Databricks automatically provisions, scales, and terminates VMs based
on workload demand.
Spot Instances (Low-Cost VMs): Uses Azure Spot VMs to reduce compute costs.
Databricks Auto-scaling: Dynamically adds/removes worker nodes to handle varying
workloads.
High Availability: Distributes workloads across multiple Azure Availability Zones for fault
tolerance.

Compute Modes in Azure Databricks:

Mode Description Use Case

Standard Clusters Manual or auto-scaling clusters Ad-hoc queries, ETL jobs

High Concurrency
Optimized for multiple users Shared analytics, BI tools
Clusters

Created for a single job, auto- Automated ETL jobs, ML


Job Clusters
terminates training

14 | P a g e
Databricks Tutorial
Mode Description Use Case

Serverless Pools
Fully managed, instant scaling Real-time analytics
(Preview)

5. What is the Managed Storage Container Used For?

Databricks uses Azure Storage Containers (ADLS Gen2 or Blob Storage) to store and manage
files, tables, and logs.

Purpose of Managed Storage Containers:

Stores Data Files: Databricks automatically stores tables and logs in a dedicated Azure
Blob Storage container.
Supports Databricks File System (DBFS): Provides a unified file management system in
Databricks.
Used for Temporary and Persistent Storage:

• DBFS /mnt/ mounts → Connect to external storage (ADLS, Blob, S3).

• DBFS /dbfs/ directory → Stores workspace and cluster logs.

Types of Storage in Databricks:

Storage Type Description Usage

DBFS (Databricks File Stores scripts, libraries,


Managed storage in Databricks
System) notebooks

Azure ADLS (Azure Data Stores large datasets for


External, scalable data storage
Lake Storage) analytics

Blob Storage Unstructured object storage Stores logs, images, videos

Optimized storage layer for Ensures ACID transactions,


Delta Lake Tables
structured data performance

Conclusion

• Databricks on Azure seamlessly integrates with Azure services for compute, storage,
and security.

• Clusters are spun up using Azure Virtual Machines (VMs), with automatic scaling
and termination to optimize costs.

15 | P a g e
Databricks Tutorial
• A Managed Resource Group is created to handle all Azure resources securely.

• Compute is managed efficiently through elastic clusters, Spot VMs, and high
availability features.

• Managed Storage Containers store data, logs, and tables, providing a unified file
system for analytics.

Databricks Unity Catalog: A Complete Guide

Databricks Unity Catalog is a unified data governance solution that provides fine-grained
access control, metadata management, and lineage tracking across multiple clouds and
data sources. It simplifies governance for structured, semi-structured, and unstructured
data.

1. What is Unity Catalog?

Unity Catalog is Databricks’ centralized data governance layer that enables:


Fine-grained access control: Row/column-level security for users and groups.
Data Lineage: Tracks end-to-end lineage for all assets in Databricks.
Multi-cloud support: Works across AWS, Azure, and Google Cloud.
Cross-workspace data sharing: Securely share data across multiple workspaces.
Three-level namespace: Organizes data into Catalog → Schema → Tables.

Why Use Unity Catalog?

• Ensures data security with centralized policy enforcement.

• Reduces compliance risks with audit logs and lineage tracking.

• Provides a unified interface for managing data across cloud platforms.

2. What is Metastore? How Databricks Governance Works?

Metastore in Unity Catalog

• The Metastore is a top-level governance layer in Unity Catalog.

• It acts as a central metadata repository that stores information about catalogs,


schemas, tables, and permissions.

• A single Metastore can be shared across multiple Databricks workspaces.

Databricks Governance Model

Access Control via Unity Catalog

16 | P a g e
Databricks Tutorial
• Users, groups, and service principals are assigned roles & permissions (Owner,
Editor, Viewer).

• Fine-grained controls allow table, column, or row-level access policies.

Data Lineage & Audit Logs

• Tracks lineage across ETL jobs, queries, and notebooks.

• Logs user activity for compliance & auditing.

Secure Data Sharing

• Enables cross-workspace and cross-cloud data sharing without data duplication.

3. What is a Catalog in Databricks?

A Catalog is the top-level container for organizing data within Unity Catalog.

• It acts as a collection of schemas and tables.

• A Catalog provides governance policies that define access at schema and table
levels.

Example:

Sql

CREATE CATALOG sales_data;

USE CATALOG sales_data;

• Here, sales_data is a catalog that can contain multiple schemas (like transactions,
customers).

Hierarchy in Unity Catalog

Metastore → Catalog → Schema (Database) → Tables & Views

4. What is Unity Catalog Data Governance Object Model?

The Unity Catalog Data Governance Model defines how objects are structured, governed,
and secured in Databricks.

Key Objects in Unity Catalog:

17 | P a g e
Databricks Tutorial
Object Description

Metastore Central repository for metadata and governance policies

Catalog Top-level container grouping schemas & tables

Schema (Database) Groups tables, views, and functions within a catalog

Table Stores structured data

View Virtual table based on SQL queries

External Location Defines access to cloud storage (S3, ADLS, GCS)

Storage Credential Manages access to cloud storage

Security in Unity Catalog:

• Identity-based access control using Azure AD, IAM roles, or SCIM.

• Attribute-based access control (ABAC) for dynamic security policies.

• Data masking & encryption for sensitive columns.

5. What is Three-Level Namespace in Unity Catalog?

Unity Catalog follows a three-level namespace to organize data efficiently:

<catalog>.<schema>.<table>

Breakdown of the Namespace:

Example Usage in SQL:

• sales_data → Catalog

• transactions → Schema

• order_details → Table

18 | P a g e
Databricks Tutorial

Conclusion

• Unity Catalog provides centralized governance, security, and data lineage in


Databricks.

• Metastore is the top-level metadata repository, ensuring secure access to data.

• Catalogs organize schemas and tables, following a three-level namespace.

• Governance in Unity Catalog enforces fine-grained security policies for structured


and unstructured data.

Databricks: Hive Metastore, Tables, Views, and DBFS

Databricks integrates with the Hive Metastore to manage metadata for databases, tables,
and views. It provides managed and external tables for storing structured data, with
different storage and governance models. Let's explore these concepts in a structured way.

1. What is Hive Metastore Catalog in Databricks?

Hive Metastore in Databricks

• The Hive Metastore is a centralized metadata repository that keeps track of


databases, tables, schemas, and views.

• It stores information about where data is stored (storage location), table schema
(columns, data types), and permissions.

• In Databricks, the Hive Metastore is used by default to manage tables when Unity
Catalog is not enabled.

Key Features of Hive Metastore:


Supports both managed and external tables.
Stores metadata in a MySQL, PostgreSQL, or other relational database.
Allows SQL-based metadata queries (SHOW TABLES, DESCRIBE TABLE).

Example: Checking Metadata in Hive Metastore

2. What is a Managed Table?

19 | P a g e
Databricks Tutorial
A Managed Table (also called an Internal Table) is a table fully controlled by Databricks.

• Databricks manages both metadata and data storage.

• When a Managed Table is dropped, both the table and underlying data are deleted.

Creating a Managed Table:

• Data is stored in the default location in DBFS (Databricks File System).

• Deleting the table removes both the metadata and data.

Best Used When:


✔ You want Databricks to manage data storage automatically.
✔ You don’t need external storage management (like S3, ADLS, or GCS).

3. What is an External Table?

An External Table in Databricks stores metadata in the Hive Metastore, but the actual data
remains in an external storage system (like S3, ADLS, or GCS).

Key Characteristics:
Metadata is managed by Databricks, but data remains in external storage.
Dropping the table only removes metadata; the data remains untouched.
Commonly used for data lakes, external data sources, or cross-platform access.

Creating an External Table in Databricks:

• The table's metadata is stored in Hive Metastore.

20 | P a g e
Databricks Tutorial
• The data is stored in Amazon S3, Azure ADLS, or Google Cloud Storage.

• If you DROP the table, the data remains intact in storage.

Best Used When:


✔ You need to store data outside Databricks in cloud storage.
✔ You want data to persist even after dropping the table.

4. Difference Between Managed and External Tables

Feature Managed Table External Table

Storage Managed by Databricks (DBFS) Stored in external storage (S3, ADLS, GCS)

Metadata Stored in Hive Metastore Stored in Hive Metastore

Data Control Databricks fully controls data Data remains in external storage

Data Deletion Deleting the table deletes data Deleting the table keeps data

Best Use Case Fully managed internal storage External data lake integration

5. What are Views?

A View in Databricks is a virtual table based on a SQL query.

• Views do not store actual data but reference data from existing tables.

• Useful for query abstraction, security, and simplified analytics.

Creating a View in Databricks:

Users can query the view like a table:

• Views help enforce security by restricting access to specific columns or rows.

21 | P a g e
Databricks Tutorial
Types of Views:

• Regular Views: Computed at runtime when queried.

• Materialized Views: Precomputed and stored for performance optimization.

6. What is the Default Location for Managed Hive Metastore Tables?

When you create a Managed Table, the data is stored in DBFS (Databricks File System)
under:

Example for a table in the "sales_data" database:

This is an internal storage location managed by Databricks.

External Tables do not use this location since they point to external cloud storage.

7. What is DBFS (Databricks File System)?

Databricks File System (DBFS) is an abstraction layer over cloud storage that provides a
unified file management interface for Databricks users.

Key Features of DBFS:


Persistent storage for managed tables and workspace files.
Can mount external storage (S3, ADLS, GCS) for seamless integration.
Supports structured, semi-structured, and unstructured data formats.

DBFS Storage Structure:

Example: Creating a Mount Point in DBFS for External Storage (Azure ADLS)

22 | P a g e
Databricks Tutorial

• This links external ADLS storage to Databricks, allowing users to access it like a local
filesystem.

Listing Files in DBFS:

DBFS Best Practices:


✔ Use DBFS for managed storage, but mount external storage for large datasets.
✔ Organize data in separate directories for structured & unstructured formats.

Conclusion

• Hive Metastore manages metadata for Databricks tables and integrates with SQL-
based queries.

• Managed Tables store both metadata & data inside Databricks (DBFS), while
External Tables store only metadata in Databricks but keep data in external storage.

• Views provide virtual tables for abstraction & security without storing actual data.

• DBFS is a unified file system for mounting external storage & managing internal
storage efficiently.

Setting Up Unity Catalog in Databricks: Step-by-Step Guide

Unity Catalog in Databricks provides centralized governance for data and access
management across multiple workspaces. This guide will cover the setup process step by
step.

1. How to Set Up Unity Catalog for Databricks Workspace?

To enable Unity Catalog in Databricks, follow these high-level steps:


Create a Metastore (centralized metadata storage).

23 | P a g e
Databricks Tutorial
Configure a Cloud Storage Account for storing Unity Catalog-managed data.
Assign the Metastore to one or more Databricks Workspaces.
Configure Identity & Access Management (IAM) roles.
Enable Unity Catalog for the workspace.

2. How to Create a Metastore in Databricks?

What is a Metastore?

A Metastore is a metadata repository that stores information about databases, tables, and
permissions across multiple workspaces.

Steps to Create a Metastore in Databricks:


1. Go to Databricks Account Console:

• Navigate to https://accounts.cloud.databricks.com

• Click on "Data" → "Metastore"

2. Click on "Create Metastore"

• Enter Metastore Name (e.g., my-unity-metastore)

• Select the cloud provider (AWS, Azure, or GCP)

• Choose the region (must match the Databricks workspace region)

3. Set Up Storage Credentials:

• The Metastore requires cloud storage (S3, ADLS, or GCS) to store Unity Catalog-
managed data.

• Grant Databricks permissions to access the storage.

4. Click "Create" to finalize the Metastore.

✔ Metastore is now created and ready to be assigned to a workspace.

3. How to Create a Storage Account for Metastore?

Unity Catalog requires a cloud storage account to store data.

For Azure: Create an Azure Data Lake Storage (ADLS) Account

1. Go to the Azure Portal → Search for Storage accounts.


2. Click "Create" → Fill in:

24 | P a g e
Databricks Tutorial
• Subscription: Choose the right subscription.

• Resource Group: Create a new group or use an existing one.

• Storage Account Name: e.g., unitycatalogstorage.

• Region: Must match the Databricks workspace region.

• Performance Tier: Choose Standard or Premium.

3. Enable Hierarchical Namespace (HNS) (Required for ADLS Gen2).


4. Click "Review + Create" and wait for deployment.

For AWS: Create an S3 Bucket

1. Go to AWS Console → Navigate to S3.


2. Click "Create Bucket", enter:

• Bucket Name: e.g., databricks-unity-storage.

• Region: Must match the Databricks workspace.

• Block Public Access: Keep enabled.

3. Click "Create".

For GCP: Create a Google Cloud Storage (GCS) Bucket

1. Go to Google Cloud Console → Navigate to Cloud Storage.


2. Click "Create Bucket", enter:

• Bucket Name: e.g., unitycatalog-bucket.

• Location: Match Databricks region.

3. Click "Create".

✔ Storage account is now ready for use with Unity Catalog.

4. How to Assign Metastore to Databricks Workspace?

Steps to Assign Metastore to a Workspace:

1. Go to Databricks Account Console → Navigate to "Workspaces".


2. Select the Workspace to which you want to assign the Metastore.
3. Click "Assign Metastore" and select the Metastore created earlier.
4. Click "Confirm".

✔ The Metastore is now linked to the Databricks Workspace.

25 | P a g e
Databricks Tutorial

5. How to Enable Unity Catalog for a Databricks Workspace?

Once the Metastore is assigned, enable Unity Catalog to start using it.

Steps to Enable Unity Catalog:

1. Enable IAM Permissions for Databricks:

• Grant Databricks Account Admin role permissions to access the Metastore.

• Configure Storage Access Policies in AWS IAM, Azure RBAC, or GCP IAM.

2. Enable Unity Catalog in the Databricks UI:

• Open Databricks Workspace.

• Navigate to Admin Console → Unity Catalog.

• Click "Enable Unity Catalog".

3. Verify Unity Catalog is Enabled:

• Run this SQL query in a Databricks Notebook:

• You should see your Unity Catalog Metastore in the list.

✔ Unity Catalog is now enabled, and your workspace can manage data across multiple
workspaces securely.

Summary: Complete Setup Workflow

Databricks Unity Catalog: Creating Catalogs with External Locations

This guide will explain step-by-step how to set up external locations and storage credentials
in Unity Catalog using SQL.
26 | P a g e
Databricks Tutorial

1. How to Create a Catalog Using an External Location in Databricks?

What is an External Location?

An External Location in Databricks maps cloud storage (AWS S3, Azure ADLS, or Google GCS)
to a Unity Catalog so that data can be read and written without being fully managed by
Databricks.

Steps to Create a Catalog Using an External Location:

1. Create a Storage Credential (Grants access to cloud storage).


2. Create an External Location (Maps the storage location).
3. Create a Unity Catalog and Link to the External Location.

2. How to Create Unity Catalog with External Location Using SQL?

You can create a catalog in Unity Catalog using SQL commands in a Databricks notebook.

Step 1: Create a Storage Credential

A storage credential is required to authenticate Databricks access to the external cloud


storage.

For Azure ADLS Gen2

For AWS S3

For Google Cloud Storage (GCS)

27 | P a g e
Databricks Tutorial

✔ This credential allows Unity Catalog to access the cloud storage.

Step 2: Create an External Location

An external location is a mapping between Databricks and a specific storage path.

✔ This links Databricks to Azure ADLS storage (for AWS or GCS, use the relevant URL format).

Step 3: Create a Catalog and Link it to External Location

Now, create a catalog using the external location.

✔ The catalog is now created and linked to the external storage.

3. How to Create an External Location in Databricks Unity Catalog?

An External Location is a metadata object in Unity Catalog that represents cloud storage.

Steps to Create an External Location:

1. Ensure that a Storage Credential is created.


2. Run the following SQL command:

28 | P a g e
Databricks Tutorial

✔ This allows Databricks to read/write data from external storage.

4. How to Create Storage Credential in Databricks?

A Storage Credential is used to authenticate Unity Catalog with external storage.

Using SQL:

For Azure ADLS Gen2

For AWS S3

For Google Cloud Storage (GCS)

✔ Now, Databricks can authenticate with external storage.

5. How to Connect Databricks to Storage Accounts?

For Azure (ADLS)

29 | P a g e
Databricks Tutorial
Steps:
1. Go to Azure Portal → Create Storage Account
2. Enable Hierarchical Namespace (Required for ADLS Gen2)
3. Assign IAM Role (Storage Blob Data Contributor) to Databricks

Databricks Configuration:

For AWS (S3 Buckets)

Steps:
Go to AWS IAM → Create an IAM Role
Attach S3 Permissions (e.g., AmazonS3FullAccess)
Attach Role to Databricks

Databricks Configuration:

For Google Cloud Storage (GCS)

Steps:
1. Go to Google Cloud Console → Create a Service Account
2. Grant Storage Admin Role to the Service Account
3. Generate a Key File (JSON)

Databricks Configuration:

Summary: End-to-End Process

30 | P a g e
Databricks Tutorial

Databricks Unity Catalog: Working with External Locations and Storage

This guide explains how to create schemas, tables, and catalogs using external locations in
Unity Catalog.

1. How to Create a Schema Using an External Location in Databricks?

What is a Schema in Unity Catalog?

A Schema (also called a database) is a logical grouping of tables and views within a Catalog.

Steps to Create a Schema with an External Location:

1. Ensure that an External Location is created


2. Run the following SQL command:

✔ This creates a schema that stores tables in Azure ADLS Gen2.

2. How is Managed Table Data Stored with External Locations?

Databricks Unity Catalog Object Model

Unity Catalog follows a hierarchical structure:

Catalog → Schema → Tables

Example: Table with External Location at Different Levels

31 | P a g e
Databricks Tutorial

✔ Each level can have a different external storage path.

3. How to Create Unity Catalog with External Location Using SQL?

Steps to Create a Unity Catalog with External Storage:

1. Ensure that a Storage Credential is created.


2. Create an External Location.
3. Create a Catalog using the External Location.

SQL Command:

✔ The catalog is now mapped to an external storage location.

4. How to Create an External Location in Databricks Unity Catalog?

What is an External Location?

An External Location links Databricks to cloud storage (AWS S3, Azure ADLS, or GCS).

Steps to Create an External Location:

1. Ensure that a Storage Credential is created.


2. Run the SQL command:

32 | P a g e
Databricks Tutorial

✔ This connects Unity Catalog to cloud storage.

5. How to Create a Storage Credential in Databricks?

What is a Storage Credential?

A Storage Credential is required to authenticate Databricks to cloud storage.

SQL Commands:

For Azure ADLS Gen2

For AWS S3

For Google Cloud Storage (GCS)

✔ This allows Databricks to authenticate with external storage.

6. How to Connect Databricks to Storage Accounts?

For Azure (ADLS Gen2)

33 | P a g e
Databricks Tutorial
Steps:
1. Go to Azure Portal → Create Storage Account
2. Enable Hierarchical Namespace (Required for ADLS Gen2)
3. Assign IAM Role (Storage Blob Data Contributor) to Databricks

Databricks Configuration:

For AWS (S3 Buckets)

Steps:
1. Go to AWS IAM → Create an IAM Role
2. Attach S3 Permissions (e.g., AmazonS3FullAccess)
3. Attach Role to Databricks

Databricks Configuration:

For Google Cloud Storage (GCS)

Steps:
1. Go to Google Cloud Console → Create a Service Account
2. Grant Storage Admin Role to the Service Account
3. Generate a Key File (JSON)

Databricks Configuration:

Summary: End-to-End Process

34 | P a g e
Databricks Tutorial

Databricks Unity Catalog: Managed Tables, External Tables, and UNDROP

1. Difference Between Managed Table and External Table in Unity Catalog

Databricks Unity Catalog supports two types of tables:


Managed Tables – Databricks manages the storage.
External Tables – Users manage the storage in external cloud locations.

Key Differences

Example: Creating a Managed Table

Data is stored in Databricks-managed storage.

Example: Creating an External Table

35 | P a g e
Databricks Tutorial

Data is stored in an external Azure ADLS container.

2. How is a Managed Table in Unity Catalog Different from Legacy Hive Metastore?

Key Takeaways
Unity Catalog provides better governance, security, and performance.
Hive Metastore lacks centralized governance and cross-workspace support.

3. What is UNDROP in Databricks?

UNDROP is a Databricks feature that restores dropped tables, schemas, and catalogs.

Use Cases of UNDROP

Accidentally deleted a table? UNDROP restores it.


Need to recover a deleted schema or catalog? UNDROP can bring it back.

How to Use UNDROP?

Restore a Table

36 | P a g e
Databricks Tutorial

Brings back a dropped table.

Restore a Schema

Recovers a deleted schema with all tables.

Restore a Catalog

Restores an entire deleted catalog.

Limitations:

• Only works if Databricks has retained metadata.

• Data might be lost for external tables (if storage is deleted).

Summary

Databricks: Delta Table Cloning, Views, and Metadata Listing

1. How to Clone Delta Tables in Databricks?

Databricks allows cloning of Delta Tables to create a new copy of a table without manually
copying the data.

Types of Cloning in Delta Tables:

37 | P a g e
Databricks Tutorial
Deep Clone – Full copy of metadata + data.
Shallow Clone – Only metadata reference; data is not copied.

SQL Command to Clone a Delta Table

Create a Deep Clone

Creates a complete copy (data + metadata).

Create a Shallow Clone

Copies metadata but references the original data.

2. What is Deep Clone of Delta Table?

A Deep Clone creates a full copy of a Delta table, including data and metadata.
✔ The cloned table is independent of the source table.
✔ Even if the source table is deleted, the cloned table still exists.

Example: Deep Clone with Path Specification

Use Cases of Deep Clone:

Creating backups for disaster recovery.


Creating test environments without modifying production data.

3. What is Shallow Clone of Delta Table?

A Shallow Clone copies only metadata, NOT data.


✔ The cloned table references the source table’s data.
✔ If the source table is deleted, the shallow clone becomes unusable.
38 | P a g e
Databricks Tutorial
Example: Shallow Clone

Use Cases of Shallow Clone:

Creating temporary test environments with minimal storage.


Quickly making replica tables for testing without data duplication.

4. How is CTAS Different from Deep Clone of Delta Table?

CTAS (CREATE TABLE AS SELECT) vs. Deep Clone

Example of CTAS (Does NOT retain metadata or history)

Creates a new table with data but NO metadata history.

Example of Deep Clone (Retains history & schema changes)

Creates an identical table, including history.

5. Difference Between Deep and Shallow Clone of Delta Table

39 | P a g e
Databricks Tutorial

6. When to Use Deep and Shallow Clone of Delta Table?

7. What Are Temporary Views and Permanent Views in Databricks?

Temporary Views
✔ Session-based: Exists only during the user’s session.
✔ Not stored permanently.
✔ Useful for quick transformations or queries.

Example: Creating a Temporary View

Permanent Views
✔ Stored permanently in a schema.
✔ Accessible across multiple sessions.
✔ Useful for sharing results across users.

Example: Creating a Permanent View

40 | P a g e
Databricks Tutorial
Key Difference: Temporary views disappear after the session ends, while permanent
views persist.

8. Difference Between a View and a Table

Example of a View (No storage, only query results)

Example of a Table (Stores actual data)

9. How to List Catalogs, Schemas, and Tables in Databricks?

List All Catalogs (Databases)

List All Schemas (Databases) in a Catalog

List All Tables in a Schema

41 | P a g e
Databricks Tutorial
List All Views in a Schema

Describe a Table’s Structure

Summary: Key Takeaways

Delta Tables: Merging, Upserts, and SCD1 in Databricks

1. How to Merge Logic in Delta Tables?

Merging in Delta Tables allows you to update, insert, or delete records efficiently. The
MERGE statement is used for handling incremental changes like:
Upserts (Insert or Update)
Slowly Changing Dimensions (SCD1, SCD2, etc.)
Soft Deletes

Basic Syntax of MERGE

42 | P a g e
Databricks Tutorial
Checks for a match between target_table and source_table based on id.
Updates matching records and inserts new records if no match is found.

2. What are Merge Conditions? How to Do Upserts in Delta Tables Using Merge?

Merge Conditions define how records in target and source tables are compared.

Types of Conditions in MERGE:

Upsert (Update + Insert) Using MERGE

Existing records are updated.


New records are inserted.

Use Case: Ideal for incremental data ingestion where some records need updates and
some need inserts.

3. How to Do SCD1 Using MERGE in Delta Tables?

What is SCD1 (Slowly Changing Dimension Type 1)?


Overwrites old values with the latest data (No history tracking).

SCD1 Implementation Using MERGE

43 | P a g e
Databricks Tutorial

Existing customer details are updated.

New customers are inserted.

No history is maintained (latest data overwrites old values).

4. How to Do Soft Delete of Incremental Data Using Merge Statement?

What is Soft Delete?

Instead of deleting rows physically, we mark them as inactive using a status column
(is_deleted).

Soft Delete Using MERGE

Instead of deleting records, we mark them as deleted.

Ensures data remains in the table but is flagged as inactive.

44 | P a g e
Databricks Tutorial
Use Case: Useful for audit logs, compliance tracking, and preventing accidental
deletions.

Summary: Key Takeaways

Delta Tables: Deletion Vectors, Liquid Clustering, Optimization & File Reading in Databricks

1. What Are Deletion Vectors in Delta Tables?

Definition:

Deletion Vectors logically mark rows as deleted without physically removing them. This
improves query performance and avoids data rewriting.

How They Work?

✔ Instead of rewriting the entire file when deleting a row, Databricks stores deletion
markers.
✔ Query engines skip deleted rows when reading.
✔ Physical deletion happens later during OPTIMIZE operations.

Example: Soft Delete Using Deletion Vectors

Without Deletion Vectors: The entire Parquet file is rewritten.


With Deletion Vectors: Only the deleted row is tracked, reducing cost & improving
efficiency.

Use Case: Improves DELETE, MERGE, and UPDATE performance for large datasets.

2. What Is Liquid Clustering in Delta Tables?

Definition:

45 | P a g e
Databricks Tutorial
Liquid Clustering is an adaptive data clustering mechanism that replaces static Z-
ORDERING. It dynamically organizes data for faster queries without rigid file structure
constraints.

Benefits of Liquid Clustering:

No need for manual re-clustering (Z-ORDER)


Automatically adapts to data distribution changes
Better performance for high-cardinality columns
Minimizes shuffle costs during queries

Enabling Liquid Clustering

Use Case: Improves query performance for large datasets with high-cardinality
columns.

3. How Liquid Clustering Improves Performance in Delta Tables?

Liquid Clustering dynamically adjusts data layout, leading to:


✔ Better data pruning → Queries scan fewer files.
✔ Reduced shuffle operations → Faster joins & aggregations.
✔ Efficient metadata handling → Less strain on query planning.

Example: Queries on customer_id (high-cardinality) run faster when Liquid Clustering


groups similar values together without over-compacting files.

4. How to Optimize a Delta Table with High-Cardinality Columns?

Why Is Optimization Needed?

• High-cardinality columns (e.g., user_id, transaction_id) cause data fragmentation.

• Queries become slow due to poor data locality.

Steps to Optimize a High-Cardinality Delta Table:

Step 1: Enable Liquid Clustering

46 | P a g e
Databricks Tutorial

Step 2: Optimize the Table Periodically

Use Case: Improves query efficiency for tables with millions of unique values.

5. How to Read a File Using SQL in Databricks?

Reading a CSV File:

Reading a JSON File:

Reading a Parquet File:

Use Case: Enables ad-hoc analysis of raw files in SQL without creating tables.

6. How to Enable Liquid Clustering on a Delta Table?

✔ Step 1: Check if Liquid Clustering is enabled:

✔ Step 2: Enable Liquid Clustering:

47 | P a g e
Databricks Tutorial

✔ Step 3: Optimize for best performance:

Use Case: Boosts query speed without excessive file reorganization.

7. What Is Delta Clustering?

Definition:

Delta Clustering is a data layout optimization technique that organizes files for efficient
data access.

Types of Delta Clustering:

Use Case: Choose Liquid Clustering for high-cardinality data & Z-ORDER for low-
cardinality data.

Summary of Key Concepts

Databricks Volumes: Managed & External Storage Explained

1. What Are Volumes in Databricks?

48 | P a g e
Databricks Tutorial
Definition:

Volumes in Databricks provide a structured way to store and manage unstructured data,
such as images, videos, PDFs, and JSON files. They allow users to manage data within
Unity Catalog using standard SQL commands.

Key Features of Volumes:

✔ Supports both structured and unstructured data


✔ Manages access control via Unity Catalog
✔ Can be Managed or External
✔ Supports file operations with SQL & Python

Use Case: Ideal for storing logs, images, ML model outputs, and semi-structured data.

2. How to Create a Managed Volume in Databricks?

A Managed Volume stores data inside Databricks-managed storage.

Steps to Create a Managed Volume:

✔ Step 1: Create a catalog and schema (if not already created).

✔ Step 2: Create a managed volume inside the schema.

✔ Step 3: List volumes inside a schema.

Storage Location: Databricks manages the storage internally.

3. How to Create an External Volume in Databricks?

49 | P a g e
Databricks Tutorial
An External Volume stores data outside Databricks-managed storage, typically in cloud
storage (S3, ADLS, GCS).

Steps to Create an External Volume:

✔ Step 1: Create a Unity Catalog external location (if not already created).

✔ Step 2: Create an external volume using the location.

✔ Step 3: Verify external volumes.

Storage Location: Data remains in S3, ADLS, or GCS but is accessible via Unity Catalog.

4. How to Store Unstructured Data in Databricks?

Storing Unstructured Data in Volumes

Databricks Volumes support images, videos, PDFs, and other unstructured formats.

Steps to Store Files in a Managed Volume:

✔ Step 1: Upload a file using Python.

✔ Step 2: Check stored files.

50 | P a g e
Databricks Tutorial
Use Case: Store ML model artifacts, logs, images, and more in a managed location.

5. How to Read Files from Databricks Volumes?

Reading Files from a Managed Volume

Use SQL or Python to access files stored in a volume.

✔ Read a CSV File Using SQL:

✔ Read a JSON File Using Python:

✔ List All Files in a Volume:

Use Case: Read logs, model outputs, and training datasets directly from Volumes.

Summary of Key Concepts

Databricks Utilities (dbutils) - A Comprehensive Guide

1. What is Databricks Utility (dbutils)?

dbutils is a built-in Databricks utility library that provides file system access, widgets,
secrets, and notebook operations to enhance workflow automation.

51 | P a g e
Databricks Tutorial
Key Features of dbutils:
✔ File System Management (dbutils.fs) – Copy, move, list, or delete files in DBFS.
✔ Widgets (dbutils.widgets) – Create input fields to make notebooks interactive.
✔ Secrets (dbutils.secrets) – Securely manage credentials (keys, passwords).
✔ Notebook Operations (dbutils.notebook) – Pass parameters, chain notebooks.

Use Case: Automate tasks, handle storage, manage configurations in Databricks.

2. How to Use dbutils Commands in Databricks?

General Syntax:

Example:

This lists all directories in Databricks File System (DBFS).

Modules Available in dbutils:

3. How to Copy Files from Local File System to DBFS?

Databricks File System (DBFS) is a distributed file system built into Databricks.

Steps to Copy Files from Local to DBFS:

✔ Step 1: Upload a file manually to DBFS via the UI or use Python.

52 | P a g e
Databricks Tutorial
✔ Step 2: Verify the file is copied.

✔ Step 3: Read the file in a Spark DataFrame.

Use Case: Move datasets, logs, or configurations from local to Databricks storage.

4. How to Use dbutils.fs Commands?

Common dbutils.fs Commands:

Example: List Files in a Directory

Use Case: Automate file management within DBFS.

5. How to Create Widgets in Databricks?

Widgets in Databricks allow users to pass parameters dynamically.

Types of Widgets:

✔ Text Widgets (for free-text input)


✔ Dropdown Widgets (for predefined options)
✔ Multiselect Widgets

53 | P a g e
Databricks Tutorial
Example: Creating a Widget

Example: Reading Widget Values

Use Case: Parameterize notebooks for dynamic queries and configurations.

6. How to Pass Parameters to a Notebook in Databricks?

Passing parameters to a notebook is useful for reusability and automation.

Calling Another Notebook with Parameters

Accessing Parameters in the Child Notebook

Use Case: Pass date filters, environment settings, or custom inputs.

7. How to Use Databricks Secrets Utility?

dbutils.secrets allows you to store and retrieve secrets securely.

Steps to Use Databricks Secrets:

✔ Step 1: Create a Secret Scope.

54 | P a g e
Databricks Tutorial
✔ Step 2: Add a Secret (CLI).

✔ Step 3: Access the Secret in Databricks Notebook.

Use Case: Store API keys, database credentials, or sensitive tokens.

8. How to Use Databricks Notebook Utility (dbutils.notebook)?

dbutils.notebook helps in running and managing notebooks programmatically.

Run a Notebook and Capture Output

Exit a Notebook with a Return Value

Use Case: Build modular, reusable notebooks for pipeline orchestration.

Summary of Key dbutils Features

55 | P a g e
Databricks Tutorial
Databricks Notebooks - Parameterization, Execution, and Scheduling

1. How to Parameterize Notebooks in Databricks?

Notebook parameterization allows you to pass values dynamically, making notebooks


reusable for different inputs.

Steps to Parameterize a Notebook

✔ Step 1: Create a widget in the notebook to accept parameters.

✔ Step 2: Retrieve the value inside the notebook.

✔ Step 3: Use it dynamically in SQL or Python logic.

Use Case: Reuse a notebook for different date ranges, regions, or configurations.

2. How to Run One Notebook from Another Notebook?

You can call one Databricks notebook from another using dbutils.notebook.run().

Syntax

Passing Parameters to Another Notebook

Use Case: Modularize your code by breaking down logic into multiple notebooks.

56 | P a g e
Databricks Tutorial

3. How to Trigger a Notebook with Different Parameters from a Notebook?

You can trigger a notebook dynamically with different parameter values.

Example: Running a Notebook with Different Parameters in a Loop

Use Case: Run the same notebook multiple times for different inputs dynamically.

4. How to Create Notebook Jobs?

A Databricks Job allows you to schedule and automate notebook execution.

Steps to Create a Job in Databricks UI:

1. Go to Workflows → Jobs
2. Click Create Job
3. Provide Job Name
4. Select Notebook to run
5. Configure Cluster, Parameters, and Schedule
6. Click Create & Run

Creating a Job Using API

57 | P a g e
Databricks Tutorial

Use Case: Automate notebook execution using Jobs API or UI.

5. How to Orchestrate Notebooks?

Notebook orchestration helps manage dependencies between multiple notebooks.

Approaches to Orchestrate Notebooks

✔ Using dbutils.notebook.run() in a Master Notebook

58 | P a g e
Databricks Tutorial

✔ Using Databricks Workflows for Multi-Task Jobs


1. Create a Job
2. Add multiple tasks (Task 1 → Task 2 → Task 3)
3. Set dependencies between tasks
4. Run the job

Use Case: Create an ETL Pipeline where data ingestion → transformation → model
training happens sequentially.

6. How to Schedule Databricks Notebooks?

You can schedule Databricks notebooks using Jobs UI or the API.

Scheduling a Job Using Databricks UI

1. Go to Workflows → Jobs
2. Select an existing Job or Create New Job
3. Under Schedule, set a Cron Expression (e.g., every hour, daily, weekly)
4. Click Save

Scheduling Using API

59 | P a g e
Databricks Tutorial

Use Case: Automate daily data processing, reporting, and machine learning workflows.

Summary of Key Databricks Notebook Operations

60 | P a g e
Databricks Tutorial

Databricks Compute - A Detailed Overview

1. What is Databricks Compute?

Databricks Compute refers to the computational resources (clusters) used to run


workloads like data processing, machine learning, and analytics on Databricks.

Key Features of Databricks Compute:

• Auto-scaling: Clusters can dynamically scale up/down based on workload.

• Optimized for Apache Spark: Runs distributed processing workloads efficiently.

• Multiple Access Modes: Supports different levels of data security.

• Supports Multi-Cloud: Available on AWS, Azure, and GCP.

• Cost Optimization: Automated termination and resource policies reduce costs.

2. What are Different Access Modes Available with Databricks Compute?

Access Modes determine how users and notebooks interact with the cluster.

Which one to use?

• Use Single User Mode for security-sensitive tasks.

• Use Shared Mode for collaborative projects.

• Use No Isolation Shared Mode for faster execution during development.

3. How to Create an All-Purpose Cluster in Databricks?

All-Purpose Clusters allow users to run multiple notebooks and jobs interactively.

Steps to Create an All-Purpose Cluster (Databricks UI)

61 | P a g e
Databricks Tutorial
1. Go to Compute → Click Create Cluster
2. Enter a Cluster Name
3. Select a Databricks Runtime Version
4. Choose Worker Type (e.g., Standard_DS3_v2 for Azure)
5. Select Autoscaling (Optional)
6. Set Access Mode (Single User, Shared, No Isolation)
7. Click Create Cluster

4. Difference Between All-Purpose and Job Compute in Databricks?

Databricks provides two types of compute environments: All-Purpose Clusters and Job
Compute (Job Clusters).

Which one to use?

• Use All-Purpose Clusters for interactive workloads.

• Use Job Clusters to reduce cost for scheduled jobs.

5. What are Different Cluster Permissions in Databricks Compute?

Cluster permissions control who can access, modify, and manage clusters.

Best Practice:

• Grant "Can Attach To" to most users.

• Reserve "Can Manage" for administrators.

62 | P a g e
Databricks Tutorial

6. What are Cluster/Compute Policies in Databricks?

Cluster Policies enforce governance and cost control for Databricks clusters.

Types of Cluster Policies

Example of a Cluster Policy (JSON Format)

Use Case: Enforce cost-efficient, secure, and optimized cluster configurations across
teams.

Summary - Databricks Compute Overview

63 | P a g e
Databricks Tutorial

Databricks Cluster Policies and Instance Pools - A Well-Curated Guide

1. How to Create a Custom Cluster Policy in Databricks?

A Custom Cluster Policy allows administrators to enforce governance, optimize costs,


and standardize cluster configurations in Databricks.

Steps to Create a Custom Cluster Policy

1. Navigate to:

• Databricks UI → Click Compute → Cluster Policies

2. Click on "Create Cluster Policy"

3. Define the Policy Name

4. Add JSON Policy Rules (Example Below)

5. Save & Assign Policy to Users/Groups

Example JSON for a Custom Cluster Policy

64 | P a g e
Databricks Tutorial

Policy Explanation:

• Uses Databricks Runtime 10.4

• Limits min/max worker nodes

• Enforces auto-termination after 30 minutes

2. How to Enforce a Policy on Existing Clusters in Databricks?

To ensure compliance, existing clusters must adhere to defined policies.

Steps to Apply a Policy to Existing Clusters

1. Go to Compute → Click on an Existing Cluster


2. Click Edit → Select a Cluster Policy from the dropdown
3. Save Changes

Note: Some configurations (like worker limits) will be enforced immediately, while
others may require restarting the cluster.

3. How to Maintain Cluster Compliance in Databricks?

Cluster compliance ensures that all teams use cost-efficient, secure, and optimized
clusters.

65 | P a g e
Databricks Tutorial
Best Practices for Compliance:

Use Cluster Policies: Restrict configurations to approved limits.


Enable Auto-Termination: Prevent idle clusters from incurring costs.
Audit Cluster Logs: Regularly check logs for violations.
Restrict Compute Access: Limit who can create/modify clusters.
Monitor Usage: Use Databricks Admin Console for tracking usage.

4. What are Pools in Databricks?

Instance Pools in Databricks help reduce cluster startup time by maintaining a set of
pre-warmed instances.

Key Benefits of Pools:

✔ Faster Cluster Startup (avoids VM provisioning delays)


✔ Cost Optimization (reuses idle instances)
✔ Efficient Resource Allocation

5. How to Create an Instance Pool in Databricks?

Steps to Create an Instance Pool:

1. Navigate to Compute → Click Instance Pools


2. Click Create Pool
3. Enter Pool Name
4. Select Worker Type (e.g., Standard_DS3_v2 for Azure)
5. Set Min & Max Idle Instances
6. Enable Preloaded Databricks Runtime (Optional)
7. Click Create Pool

6. What are Warm Pools in Databricks?

Warm Pools are a subset of Instance Pools that keep pre-initialized Spark driver and
worker instances for ultra-fast cluster spin-up.

Key Differences Between Instance Pools & Warm Pools

66 | P a g e
Databricks Tutorial

7. How to Create a Warm Pool in Databricks?

Steps to Create a Warm Pool:

1. Go to Compute → Click Instance Pools


2. Click Create Pool
3. Enter Pool Name
4. Choose "Preload Databricks Runtime" → Enable it
5. Select a Databricks Runtime Version
6. Set Minimum Idle Instances
7. Click Create Pool

Warm Pools ensure instant cluster startup for mission-critical workloads.

Summary - Databricks Cluster Policies & Pools

Databricks Workflow Jobs - A Well-Curated Guide

1. How to Create Jobs in Databricks Workflows?

Databricks Workflows allow you to schedule and automate data processing using Jobs.

Steps to Create a Job in Databricks Workflows:

1. Go to the Databricks UI → Click Workflows


2. Click Create Job
3. Enter Job Name
4. Click Add Task → Choose:

67 | P a g e
Databricks Tutorial
• Notebook (Execute a Databricks Notebook)

• JAR/Python File (Run a script)

• SQL Query

• Delta Live Table Pipeline


5. Select Cluster Type (Existing / New)
6. Configure Job Schedule (Daily, Hourly, On-Demand)
7. Click Create

Your Databricks job is now scheduled!

2. How to Pass Values from One Task to Another in Databricks Workflow Jobs?

Task Values API allows passing data between tasks in a workflow.

Example: Passing Output from Task A to Task B

Step 1: Define Output in Task A (Python Notebook)

Step 2: Retrieve Output in Task B

Task B reads Task A’s value using taskValues.get()

3. How to Create If-Else Conditional Jobs in Databricks?

Databricks allows conditional workflows using "Run If Condition" in Workflows.

Steps to Create If-Else Conditional Jobs:

1. Go to Workflows → Select Your Job


2. Click Add Task → Choose Notebook/Python
3. Click on Task Triggers → Set Depends On Condition
4. Choose "Run If" Condition

• Succeeded → Run only if previous task succeeds

68 | P a g e
Databricks Tutorial
• Failed → Run only if previous task fails

• Skipped → Run only if previous task is skipped

This method enables conditional execution based on success/failure.

4. How to Create For-Each Loop in Databricks?

Loops in Workflows allow running tasks multiple times with different inputs.

Example: Running a Notebook for Each Item in a List

Step 1: Define List & Loop Over It in Python

Step 2: Read Parameters in Notebook

This runs the notebook for each value in items.

5. How to Re-Run Failed Databricks Workflow Jobs?

Failed jobs can be retried automatically using retry policies.

Steps to Re-Run Failed Jobs:

1. Go to Workflows → Click on Job Runs


2. Identify Failed Runs
3. Click "Retry Failed Tasks"
4. Modify Retry Count (Default: 3 retries)
5. Save & Execute

This ensures automatic job recovery.

69 | P a g e
Databricks Tutorial
6. How to Override Parameters in Databricks Workflow Job Runs?

You can override parameters dynamically while running a job.

Example: Overriding Notebook Parameters in a Job

Step 1: Add Widgets in Notebook

Step 2: Override Parameter in Job Run

This method allows dynamic parameter changes per execution.

Summary - Databricks Workflow Automation

Databricks Data Ingestion & COPY INTO Command - A Well-Curated Guide

1. How to Use COPY INTO Command to Ingest Data in Lakehouse?

The COPY INTO command in Databricks is used to efficiently ingest data from cloud
storage (Azure Blob, AWS S3, Google Cloud Storage) into Delta tables in the Lakehouse.

Syntax of COPY INTO

70 | P a g e
Databricks Tutorial

Steps to Use COPY INTO:

1. Ensure the Delta Table Exists

• If the table doesn't exist, create one before ingestion.


2. Use COPY INTO to Load Data

• Ingests data efficiently while maintaining idempotency.


3. Supports Various File Formats

• CSV, JSON, Parquet, Avro, ORC


4. Provides Schema Evolution

• Supports schema merging (mergeSchema = true)


5. Incremental Data Load

• Only processes new files, avoiding duplication.

2. How COPY INTO Commands Maintain Idempotent Behaviour?

Idempotency ensures that multiple executions of the COPY INTO command will not
cause duplicate data loading.

How COPY INTO Ensures Idempotency?

1. Tracks Processed Files in Delta Lake Transaction Log


2. Automatically Ignores Previously Processed Files
3. Allows Partial Loads Without Data Duplication
4. Provides PATTERN for Controlled Ingestion
5. Supports Schema Evolution Without Duplicates

3. How COPY INTO Processes Files Exactly Once in Databricks?

Databricks ensures exactly-once file ingestion by leveraging Delta Lake's transactional


ACID properties.

71 | P a g e
Databricks Tutorial
Mechanisms to Process Files Exactly Once

1. Maintains Metadata of Processed Files

• COPY INTO maintains a manifest of files already loaded.


2. Uses Delta Lake's Transaction Log

• Each file ingestion is logged in _delta_log to prevent reprocessing.


3. Supports Incremental Data Loads

• By default, COPY INTO only picks new files.


4. Can Specify File Patterns

• Controls which files are included.


5. Ensures Data Consistency with MERGE

• Updates existing records when schema changes.

4. How to Create Placeholder Tables in Databricks?

Placeholder tables are used in Databricks when you need to define a table structure
before inserting data.

Creating a Placeholder Table

This creates an empty table in Delta Lake.

Using a Placeholder Table with COPY INTO

Once data arrives, it is ingested automatically.

72 | P a g e
Databricks Tutorial

Summary - COPY INTO & Data Management in Databricks

Databricks Auto Loader - A Well-Curated Guide

1. How to Use Auto Loader in Databricks?

Auto Loader is a Databricks feature that enables incremental ingestion of data from
cloud storage into Delta Lake tables. It automatically detects new files and loads them in
real time.

Key Features of Auto Loader

• Supports streaming & batch modes

• Handles schema drift & evolution

• Minimizes operational overhead

• Processes millions of files efficiently

Basic Auto Loader Syntax

This reads JSON files from an S3 bucket and writes them to Delta Lake.

2. What are the Different File Detection Modes in Auto Loader?

Auto Loader supports two file detection modes for tracking new files:

73 | P a g e
Databricks Tutorial

File Notification Mode is faster & preferred for real-time streaming.

3. What is Schema Location in Auto Loader?

Schema Location is the storage path where Auto Loader stores inferred schemas for
structured ingestion.

Why Schema Location is Important?


Prevents schema inference on every run
Ensures consistency across streaming & batch jobs
Used for Schema Evolution

Example Usage

This saves schema details in dbfs:/mnt/schema_tracking/.

4. What is Schema Hints in Auto Loader?

Schema Hints allow you to manually specify expected column types when Auto Loader
infers schema.

Why Use Schema Hints?


Prevents schema inference errors
Ensures correct data types
Helps in handling semi-structured data

Example Usage

This ensures Auto Loader expects these specific data types.

74 | P a g e
Databricks Tutorial

5. What is Schema Evolution in Auto Loader?

Schema Evolution allows Auto Loader to automatically adjust when new columns
appear in incoming data.

Why Schema Evolution?


Handles unexpected schema changes
Automatically adds new columns to the table
Avoids job failures due to schema drift

Example Usage

This enables automatic new column addition.

6. What are Different Schema Evolution Modes in Auto Loader?

Auto Loader provides two schema evolution modes to handle changes dynamically.

Use addNewColumns for dynamic schema adaptation.

7. What is RocksDB?

RocksDB is a high-performance key-value store used by Auto Loader for metadata


tracking.

Why RocksDB in Auto Loader?


Stores file ingestion metadata
Enables efficient file deduplication
Enhances streaming performance

It helps Auto Loader process large-scale data efficiently.

75 | P a g e
Databricks Tutorial
8. What is File Notification Mode in Auto Loader?

File Notification Mode is an advanced mechanism that pushes real-time file


notifications for faster ingestion.

How It Works?
Uses Azure Event Grid, AWS S3 Events, or GCS Pub/Sub
Notifies Auto Loader when new files arrive
Reduces API calls, making ingestion faster

Example: Enabling File Notification Mode

This enables real-time file ingestion using event-based notifications.

Summary - Databricks Auto Loader

Medallion Architecture in Databricks - A Well-Curated Guide

1. What is Medallion Architecture in Databricks?

Medallion Architecture is a multi-layered data processing approach in Databricks that


organizes data into three structured layers:

• Bronze (Raw Data)

• Silver (Cleansed & Enriched Data)

• Gold (Business-Aggregated Data)

76 | P a g e
Databricks Tutorial
Key Benefits of Medallion Architecture:
Improves Data Quality by processing it in stages
Enhances Data Governance with structured layers
Optimizes Performance for queries and analytics
Supports Real-time & Batch Processing

This architecture enables a structured approach to data transformation and storage.

2. What is Lakehouse Medallion Architecture?

Lakehouse Medallion Architecture is the combination of:


Lakehouse Architecture (Unified storage & compute)
Medallion Architecture (Multi-layer data processing)

Key Features of Lakehouse Medallion Architecture:

• Built on Delta Lake (ACID-compliant transactions)

• Supports both structured & unstructured data

• Combines Data Warehouse & Data Lake benefits

• Enhances Data Governance & Security

It allows businesses to process raw data efficiently and turn it into valuable insights.

3. What is the Use of Bronze, Silver, and Gold Layer in Medallion Architecture?

The Medallion Architecture consists of three main layers, each serving a specific
purpose:

Bronze Layer (Raw Data Layer)

Stores raw, unprocessed data (ingested from various sources)


Includes logs, JSON, CSV, Parquet, and unstructured data
Supports data lineage & historical tracking

Example: Storing raw event logs from an application

Silver Layer (Cleansed & Enriched Data Layer)


77 | P a g e
Databricks Tutorial
Performs cleaning, deduplication, and transformation
Standardizes formats and applies data quality rules
Joins multiple sources and adds business logic

Example: Removing duplicates & standardizing dates

Gold Layer (Business Aggregation Layer)

Optimized for reporting & analytics


Contains business-ready, aggregated data
Used by BI tools like Power BI, Tableau, or ML models

Example: Aggregating sales by region

Summary - Medallion Architecture in Databricks

Medallion Architecture ensures scalable, structured, and high-quality data processing!

78 | P a g e
Databricks Tutorial

Delta Live Tables (DLT) in Databricks - A Well-Curated Guide

1. What are Delta Live Tables in Databricks?

Delta Live Tables (DLT) is an ETL framework in Databricks that simplifies data
engineering by enabling declarative data pipelines using SQL and Python.

Key Features of Delta Live Tables:

• Automated Data Processing – Manages ETL pipelines efficiently.

• Data Quality Enforcement – Built-in expectations & monitoring.

• Optimized for Streaming & Batch Workloads – Unified data ingestion.

• Change Data Capture (CDC) Support – Track updates in tables.

• Uses Delta Lake as Storage – ACID transactions & versioning.

Example of a Delta Live Table (DLT) Query:

2. What is a DLT Pipeline?

A DLT Pipeline is a fully managed data pipeline in Databricks that processes and
transforms data using Delta Live Tables.

Key Components of a DLT Pipeline:

• Source Data: Reads data from cloud storage, Kafka, etc.

79 | P a g e
Databricks Tutorial
• Transformations: Defines LIVE tables and views.

• Data Quality Rules: Ensures valid, consistent data.

• Execution Mode: Supports batch & streaming workloads.

• Monitoring & Lineage: Tracks pipeline execution.

Example of a DLT Pipeline:


1. Define a table:

2. Deploy as a DLT pipeline in Databricks UI.

3. What is a Streaming Table in DLT Pipeline?

Streaming Tables in DLT process continuous, real-time data streams.

Key Features:

• Processes incremental data instead of full table refreshes.

• Optimized for event-driven architectures.

• Ensures low-latency data ingestion.

Example of a Streaming Table in DLT:

Used for real-time analytics, fraud detection, and log processing.

4. What is a Materialized View in DLT Pipeline?

A Materialized View in DLT is a precomputed table that refreshes automatically based


on source data changes.

Key Features:

• Optimized for BI & reporting.

80 | P a g e
Databricks Tutorial
• Recomputes only when data changes (efficient).

• Faster query performance compared to standard views.

Example of a Materialized View in DLT:

Used for business dashboards, aggregated reports, and analytical workloads.

5. How to Create a DLT Pipeline?

Steps to create a Delta Live Table (DLT) pipeline:

1. Go to Databricks UI → Click on Workflows → Delta Live Tables


2. Click "Create Pipeline"
3. Select a Notebook or SQL file
4. Define LIVE tables using SQL or Python
5. Configure Pipeline Settings (Storage, Compute, Scheduling)
6. Click "Start" to Deploy the Pipeline

DLT will automatically process and monitor the pipeline execution!

6. What is the LIVE Keyword in DLT Pipeline?

The LIVE keyword is used in DLT SQL queries to reference a table within the pipeline.

Key Purpose:

• Ensures that tables are managed within DLT.

• Automatically handles dependencies between tables.

• Supports incremental processing of new data.

Example:

LIVE ensures that final_sales table is recomputed whenever sales_cleaned is updated!

81 | P a g e
Databricks Tutorial

7. Difference Between DLT Streaming Table & Materialized View

Streaming Table → Real-time updates.


Materialized View → Precomputed & optimized queries.

Summary - Delta Live Tables in Databricks

DLT automates ETL workflows in Databricks.


Supports batch & streaming processing.
Enforces data quality checks.
Optimized for real-time data pipelines.

https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/

Well-Curated Guide on Delta Live Tables (DLT) & Unity Catalog

1. How to Process Incremental Data in DLT Pipelines?

Processing incremental data in Delta Live Tables (DLT) allows handling only new or
updated records instead of reprocessing the entire dataset.

Methods to Process Incremental Data in DLT:

1. Using STREAMING LIVE TABLE

• For real-time or near real-time processing.

2. Using APPLY CHANGES (Change Data Capture - CDC)

82 | P a g e
Databricks Tutorial
• Tracks changes (INSERT, UPDATE, DELETE) efficiently.

Benefits:

• Processes only new or modified data.

• Improves performance & cost-efficiency.

• Supports schema evolution & CDC.

2. How to Rename a Table in DLT?

DLT does not directly support renaming tables. Instead, follow these steps:

1. Create a New Table with the Desired Name

2. Drop the Old Table

Alternative Approach: Use Databricks SQL to rename tables in Delta Lake (outside
DLT).

3. How to Add New Columns in DLT Tables?

DLT supports schema evolution, so you can dynamically add columns.

Method 1: Using ALTER TABLE

83 | P a g e
Databricks Tutorial

Method 2: Using SELECT with a Default Value

Key Points:

• Auto-merges new columns when schema evolution is enabled.

• Backfills data for new columns with default values.

4. How to Modify an Existing Column in DLT?

Column modifications (changing data type or renaming) are restricted in DLT.

Workaround:
1. Create a new table with the modified column.
2. Migrate data from the old table.
3. Drop the old table (if necessary).

Databricks best practice: Use Deep Clone instead of modifying tables directly.

5. What is Data Lineage in Unity Catalog?

Data Lineage in Unity Catalog provides visibility into data flow, showing how tables,
columns, and queries interact.

Key Features:

• Tracks column-level lineage (Where does the data come from?)

• Supports SQL, Python, and Delta Live Tables.

• Helps in debugging, auditing, and compliance.

84 | P a g e
Databricks Tutorial
• Automatically captures transformations.

How to View Lineage in Databricks UI?


1. Open Unity Catalog → Click on a Table.
2. Select "Lineage" tab to view upstream & downstream dependencies.

6. Internals of Delta Live Tables (DLT)

DLT uses Databricks' Delta Engine to manage ETL pipelines efficiently.

Key Internals of DLT:

1. Pipeline Execution Engine

• Optimized DAG (Directed Acyclic Graph) execution.

• Tracks dependencies between tables.

2. Schema Enforcement & Evolution

• Auto-adapts to schema changes in source data.

3. Checkpointing & Incremental Processing

• Uses WAL (Write-Ahead Logging) to track changes.

• Supports Change Data Capture (CDC) for updates.

4. Storage & Compute Optimization

• Uses Delta Lake for ACID transactions.

• Auto-compaction & file optimization for performance.

Best Practices:

• Enable Enhanced Auto-Scaling for clusters.

• Use Auto-Optimize & Auto-Compaction to improve performance.

Summary

DLT supports incremental data processing with STREAM & CDC.


Schema evolution allows adding new columns dynamically.
Data Lineage in Unity Catalog helps track transformations.
DLT pipelines run efficiently with auto-scaling & optimization.

85 | P a g e
Databricks Tutorial
Well-Curated Guide on Advanced Delta Live Tables (DLT) Concepts

1. How to Add AutoLoader in DLT Pipeline?

Databricks AutoLoader enables efficient ingestion of streaming or batch data into


Delta Live Tables (DLT).

Steps to Integrate AutoLoader in a DLT Pipeline:

1. Use AutoLoader to Ingest Data from a Cloud Storage Location:

2. Enable Schema Evolution for Dynamic Data Handling:

Key Benefits:

• Handles schema drift & late-arriving data.

• Efficiently ingests data from cloud storage.

• Supports structured & semi-structured formats.

2. What is the Use of Append Flow in DLT Pipeline?

Append Flow in DLT ensures new data is added without modifying existing records.

Use Case: Capturing new records in an event-based system.

1. Create an Append-only Table in DLT:

Why Use Append Flow?

• Prevents accidental data overwrites.

86 | P a g e
Databricks Tutorial
• Ensures proper historical data tracking.

• Ideal for event-based & CDC (Change Data Capture) use cases.

3. How to Union Data in DLT Pipeline?

Combining multiple datasets into a single unified dataset is done using UNION.

Example: Merging two static tables

Best Practices:

• Ensure both datasets have the same schema.

• Use DISTINCT to remove duplicate records.

4. How to Union Streaming Tables in DLT Pipelines?

Merging multiple streaming tables efficiently.

Example: Combining two live streaming sources

Key Considerations:

• Ensure schema consistency between tables.

• Use WINDOW functions if event timestamps differ.

• Monitor ingestion rate to avoid processing delays.

87 | P a g e
Databricks Tutorial
5. How to Pass Parameters in DLT Pipelines?

DLT allows parameterization using Databricks Widgets or Pipeline Settings.

Method 1: Using Pipeline Configuration Parameters

• Navigate to Databricks UI → Workflows → DLT Pipeline → Edit Pipeline → Add


Configuration Parameters.

• Example Parameter: table_name = "customer_data"

Method 2: Using Parameters in SQL DLT Pipelines

Method 3: Using Python with dbutils

Why Pass Parameters?

• Makes pipelines reusable & dynamic.

• Supports environment-specific configurations.

• Helps in customizing table creation.

6. How to Generate DLT Tables Dynamically?

DLT supports dynamic table creation using Python & SQL queries.

Method 1: Using Python to Create Tables Based on Metadata

Method 2: Using SQL with Dynamic Naming

88 | P a g e
Databricks Tutorial

Use Cases for Dynamic Table Generation:

• Handling multiple datasets without writing separate queries.

• Automating ETL pipeline scaling.

• Creating tables on-the-fly based on metadata configurations.

Summary

AutoLoader efficiently ingests data into DLT pipelines.


Append Flow prevents overwriting existing records.
Union in DLT combines datasets, supporting both static & streaming tables.
DLT supports parameterization for reusable & dynamic pipelines.
Dynamic table generation allows automation in data processing.

Well-Curated Guide on Slowly Changing Dimensions (SCD) & CDC in Databricks Delta Live
Tables (DLT)

1. How to Create an SCD Type 2 Table in DLT?

Slowly Changing Dimension Type 2 (SCD2) maintains historical records by adding new
rows with start and end timestamps.

Steps to Create an SCD2 Table in Delta Live Tables (DLT):

SQL Approach

89 | P a g e
Databricks Tutorial

PySpark Implementation:

90 | P a g e
Databricks Tutorial

Key Features of SCD2:

• Keeps historical versions of data.

• Adds new records instead of updating old ones.

• Uses MERGE INTO to manage updates.

2. How to Create an SCD Type 1 Table in DLT?

Slowly Changing Dimension Type 1 (SCD1) updates records in place, without


maintaining historical data.

Steps to Create an SCD1 Table in DLT:

SQL Approach

PySpark Implementation:

91 | P a g e
Databricks Tutorial

Key Features of SCD1:

• Overwrites old data without keeping history.

• Simpler and more space-efficient than SCD2.

• Useful when only the latest data is required.

3. How to Backfill or Backload Missing Data in SCD2 Table in DLT?

Backfilling ensures past records are included in SCD2 tables.

Steps to Backfill Data in SCD2 Table:


1. Identify missing records using a LEFT JOIN.
2. Use MERGE INTO to insert missing records.

Why Backfill Data?

92 | P a g e
Databricks Tutorial
PySpark Implementation:

• Ensures historical completeness.

• Useful for reloading lost data after failures.

4. How to Delete Data from SCD Tables in DLT?

Deleting records in SCD tables should be handled carefully to maintain history.

Steps to Delete Data from an SCD Table:

PySpark Implementation:

Best Practices:

• Use soft deletes by adding an is_deleted flag instead of hard deletes.

• Ensure business rules allow data deletion.

5. How to Truncate SCD Tables in DLT?

Truncating removes all data from a table but retains its structure.

93 | P a g e
Databricks Tutorial
Steps to Truncate an SCD Table:

PySpark Implementation:

Key Considerations:

• All data will be removed permanently.

• Best for development/test environments, not production.

6. What is Change Data Capture (CDC) in DLT?

CDC captures and tracks changes (INSERTS, UPDATES, and DELETES) in a source table.

Why Use CDC?

• Enables real-time data synchronization.

• Reduces data processing costs by processing only changes.

• Essential for incremental data processing in Lakehouse architectures.

Example CDC Table in DLT:

PySpark Implementation for CDC Table in DLT:

94 | P a g e
Databricks Tutorial
7. How to Design CDC Tables in DLT?

CDC tables store historical changes with operation types (INSERT, UPDATE, DELETE).

Steps to Design a CDC Table in DLT:

1. Create a CDC Staging Table with Metadata:

In PySpark:

2. Merge CDC Changes into the Target Table:

95 | P a g e
Databricks Tutorial

In PySpark:

Key Benefits of CDC Tables:

• Efficiently handles incremental updates.

• Supports real-time & batch processing.

• Optimized for Lakehouse & Delta architecture.

Summary

SCD Type 1 updates records in place without history.


SCD Type 2 maintains historical data with start & end timestamps.
Backfilling ensures completeness of historical data.
CDC tracks changes (INSERT, UPDATE, DELETE) efficiently.
CDC tables improve real-time data sync & reduce processing costs.

96 | P a g e
Databricks Tutorial

Data Quality in Delta Live Tables (DLT) – Well Curated Guide

Delta Live Tables (DLT) provides built-in data quality enforcement using Expectations to
ensure clean and reliable data. You can define rules for data validation, apply actions
when rules are violated, and monitor the pipeline's performance.

1. How to Use Data Quality in DLT Pipelines?

Why Data Quality Matters?

• Ensures data integrity and reliability before processing.

• Reduces bad data propagation in downstream processes.

• Enables automatic monitoring and alerts for data quality violations.

Steps to Implement Data Quality in DLT:

1. Define Expectations in DLT tables.

2. Apply Actions when an expectation fails.

3. Monitor data quality through UI and SQL queries.

Example: Defining Data Quality in DLT Table (Python)

Explanation:

• Ensures customer_id is not null and age is above 18.

• If data violates these rules, those rows are dropped (expect_all_or_drop).

97 | P a g e
Databricks Tutorial

2. How to Use Expectations in DLT?

What Are Expectations?

Expectations define rules for validating data within Delta Live Tables.

Types of Expectations

Example: Expectations in DLT

Explanation:

• If order_amount is ≤ 0, the entire pipeline fails (expect_all_or_fail).

• If currency is invalid, those rows are dropped (expect_all_or_drop).

3. What Are Different Actions in Expectations?

Actions for Failed Data Quality Checks

Example: Using Different Expectation Actions

98 | P a g e
Databricks Tutorial

Explanation:

• Fails the pipeline if price is ≤ 0.

• Drops rows where quantity is ≤ 0.

• Logs a warning if region is invalid but still includes the data.

4. How to Monitor a DLT Pipeline?

Methods to Monitor Data Quality in DLT

1. DLT UI Monitoring – Check logs, errors, and expectations in Databricks.

2. SQL Queries – Query system tables to analyze data quality.

3. Alerts & Notifications – Set up alerts based on pipeline failures.

Monitoring via Databricks UI

• Go to Databricks → Workflows → Delta Live Tables.

• Select your DLT pipeline to view logs, failures, and expectations.

Using SQL to Monitor Failures

Explanation:

• Shows all failed expectations and their counts.

5. How to Monitor a DLT Pipeline Using SQL Queries?

Query Failed Records in DLT

99 | P a g e
Databricks Tutorial

Explanation:

• Filters failed expectation checks in DLT pipelines.

Count Dropped Records

Explanation:

• Shows how many rows were dropped per table due to failed expectations.

• https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/observability

6. How to Define Data Quality Rules in DLT Pipelines?

Best Practices for Data Quality in DLT

Use multiple expectations – Validate different rules at once.


Fail only when necessary – Use warn or drop for non-critical checks.
Monitor frequently – Query system tables for failures.

Example: Applying Multiple Data Quality Rules

100 | P a g e
Databricks Tutorial
Explanation:

• Drops records with invalid email or age.

• Logs a warning if phone number is not 10 digits but continues processing.

Summary – Data Quality in DLT Pipelines

https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/expectations

By implementing these best practices, you can ensure high data quality in Delta Live
Tables (DLT) with automated monitoring and enforcement!

Delta Live Tables (DLT) – Advanced Features and Best Practices

This guide covers truncate load tables, full refresh strategies, streaming table
optimizations, and file-based triggers in Databricks Delta Live Tables (DLT).

1. How to Use Truncate Load Table as Source in DLT Pipelines?

What is a Truncate Load Table?

• A truncate load table is a table where data is fully replaced with each run.

• Used when incremental logic is not applicable (e.g., full data refresh).

Example: Using Truncate Load in DLT

101 | P a g e
Databricks Tutorial

Explanation:

• This replaces the entire customers table on every run.

• No need for incremental logic—ensures a full refresh each time.

2. What is the Use of skipChangeCommits Feature?

Why Use skipChangeCommits?

• Avoids unnecessary change commits in tables, reducing write overhead.

• Useful when using external systems (CDC tools, replication) that react to each
commit.

• Helps optimize performance by skipping redundant metadata updates.

Enabling skipChangeCommits in DLT

Explanation:

• This prevents unnecessary metadata updates, improving streaming table


performance.

102 | P a g e
Databricks Tutorial

3. How to Full Refresh a DLT Pipeline?

When to Use a Full Refresh?

• When data changes completely between runs.

• When fixing corrupt or inconsistent data.

• When schema changes require a full reload.

Methods for Full Refresh in DLT

Method 1: Using drop_existing = true

Explanation:

• Allows full reset of the table during pipeline updates.

Method 2: Manually Reset the Pipeline

Run the following command in a Databricks notebook:

Then restart the pipeline.

4. How to Avoid Streaming Tables from Getting Fully Refreshed?

Why Avoid Full Refresh on Streaming Tables?

• Streaming tables process incremental data—full refreshes can break streaming


state.

• Helps in optimizing costs and reducing processing time.

Best Practices to Avoid Full Refresh

103 | P a g e
Databricks Tutorial
1. Use apply_changes for Incremental Loads

Explanation:

• Uses apply_changes to update rows incrementally.

• Avoids full refresh in every pipeline execution.

2. Enable Checkpointing

• Helps retain state without reprocessing old data.

5. What are File Arrival Triggers in Databricks Workflows?

What is a File Arrival Trigger?

• A trigger-based mechanism that runs a job when a file arrives in a specified


location.

• Used to automate pipeline execution when new data is available.

Use Cases

Automate DLT pipelines when new files arrive.


Reduce unnecessary processing by triggering jobs only when needed.
Avoid polling overhead and improve efficiency.

104 | P a g e
Databricks Tutorial
Example: Using File Arrival Trigger in Databricks Workflows

1. Creating a File Arrival Trigger in Databricks

• Go to Databricks Workflows → Create Job.

• Select "Trigger type" → "File arrival".

• Define the storage location to monitor.

2. Example: Triggering a DLT Pipeline on New File Arrival

Explanation:

• Monitors dbfs:/mnt/raw_data/ for new .csv files.

• Triggers the DLT pipeline whenever a new file arrives.

6. How to Use File-Based Trigger in Databricks?

File-Based Trigger in Auto Loader

• File-based triggers ensure jobs execute only when data is available.

• Works well with Delta Live Tables and Auto Loader.

Example: Using File-Based Trigger in Auto Loader

105 | P a g e
Databricks Tutorial

Explanation:

• Listens for new files in dbfs:/mnt/raw_data/.

• Processes them once using .trigger(once=True).

• Stores data in a Delta Table (bronze_table).

Summary – Key Takeaways

By using these strategies, you can optimize DLT pipelines for efficiency, scalability, and
automation!

Databricks Secret Management – A Complete Guide

Databricks Secret Management allows storing and accessing sensitive data securely, such
as API keys, database credentials, and authentication tokens. This guide covers secret
scopes, Azure Key Vault integration, CLI installation, and authentication.

1. What is Databricks Secret Management?

106 | P a g e
Databricks Tutorial
Databricks Secret Management provides a secure way to store and retrieve sensitive
information.
Secrets are encrypted and stored in Secret Scopes.
Supports two types of secret scopes:
Databricks-backed scopes (managed by Databricks).
Azure Key Vault-backed scopes (integrates with Azure Key Vault).

Use Cases:
Store database credentials securely.
Manage API tokens for external services.
Secure access keys for cloud storage.

2. What are Secret Scopes in Databricks?

Secret Scopes define the boundary for storing and accessing secrets.
Each Secret Scope contains multiple secrets (key-value pairs).
Two types of Secret Scopes:

1. Databricks-Backed Secret Scope

• Managed inside Databricks.

• Stored internally within Databricks.

• Supports ACL (Access Control Lists) for permission control.

2. Azure Key Vault-Backed Secret Scope

• Integrates with Azure Key Vault.

• Secret values are stored in Azure Key Vault instead of Databricks.

• Requires Azure Key Vault setup.

3. How to Save and Use Secrets in Databricks?

Step 1: Create a Secret Scope

Method 1: Using Databricks UI

1. Go to Databricks Workspace → Settings → Admin Console.


2. Select Secret Scopes → Create.
3. Choose Databricks-Backed or Azure Key Vault-Backed scope.
4. Provide a name and create the scope.
107 | P a g e
Databricks Tutorial
Method 2: Using Databricks CLI

Step 2: Save a Secret in Databricks

Using Databricks CLI

Step 3: Use Secrets in a Databricks Notebook

Returns the secret value securely without exposing it in plaintext.

4. How to Create and Use Azure Key Vault to Save Secrets in Databricks?

Step 1: Create an Azure Key Vault

1. Go to Azure Portal → Create a Resource.


2. Search for Key Vault and create it.
3. Set up Access Policies to allow Databricks access.

Step 2: Create a Key Vault-Backed Secret Scope in Databricks

Run the following CLI command:

databricks secrets create-scope --scope my_akv_scope --scope-backend-type


AZURE_KEYVAULT --resource-id <key_vault_resource_id> --dns-name
<key_vault_dns_name>

Replace <key_vault_resource_id> and <key_vault_dns_name> with your Azure Key


Vault details.

Step 3: Use Secrets from Azure Key Vault in Databricks

108 | P a g e
Databricks Tutorial

Securely retrieves the secret stored in Azure Key Vault.

5. What is a Databricks-Backed Secret Scope?

Databricks-backed secret scopes are fully managed within Databricks.


They do not require external cloud services (like Azure Key Vault).
Secrets are encrypted and stored within the Databricks control plane.

Limitations:
Cannot access secrets outside Databricks.
No integration with Azure Key Vault.

6. How to Install Databricks CLI?

The Databricks CLI allows managing secret scopes, clusters, jobs, and workspaces from
the command line.

Step 1: Install Databricks CLI

On macOS or Linux

On Windows

7. How to Authenticate Databricks CLI?

Step 1: Configure Authentication

Run the following command:

109 | P a g e
Databricks Tutorial
Enter Databricks workspace URL and personal access token when prompted.

Step 2: Verify Authentication

Check if Databricks CLI is authenticated:

If it lists workspace directories, authentication is successful.

Summary – Key Takeaways

By leveraging Secret Management in Databricks, you can securely store and access
sensitive data without exposing credentials in notebooks!

User and Identity Management in Databricks

This guide covers user management, service principals, groups, SCIM, and auto-
provisioning from Microsoft Entra ID (Azure AD) in Databricks.

1. How to Add Users in Databricks?

Method 1: Using Databricks UI

1. Go to Databricks Workspace
2. Click on Admin Settings → Select User Management
3. Click on Add User → Enter Email ID

110 | P a g e
Databricks Tutorial
4. Assign roles (Workspace Admin, User, etc.)
5. Click Invite – The user receives an invitation via email

Users can log in once they accept the invitation.

Method 2: Using Databricks CLI

Adds a new user via CLI.

2. How to Create Service Principals in Databricks?

A Service Principal (SPN) is an identity used by applications or automation scripts to access


Databricks securely.

Step 1: Create a Service Principal in Databricks

Method 1: Using Databricks UI

1. Go to Admin Settings → Service Principals


2. Click Add Service Principal → Enter a name
3. Click Save

Method 2: Using Databricks CLI

Creates a Service Principal in Databricks.

3. How to Create a Service Principal in Azure?

A Service Principal in Azure is required for secure authentication to Databricks.

111 | P a g e
Databricks Tutorial
Steps to Create a Service Principal in Azure Portal

1. Go to Azure Portal → Search for Azure Active Directory


2. Click App Registrations → New Registration
3. Enter Name → Choose Single Tenant → Click Register
4. Copy the Application (Client) ID

Now, generate a client secret:


1. Go to Certificates & Secrets → New Client Secret
2. Copy the Client Secret (it won’t be visible later)

4. How to Use Service Principal in Databricks?

Once a Service Principal is created, authenticate it to Databricks.

Step 1: Assign Permissions to the Service Principal

Assigns permissions to run jobs, access clusters, etc.

Step 2: Authenticate Databricks CLI Using Service Principal

databricks configure --host <databricks-instance-url> --client-id <service-principal-client-


id> --client-secret <service-principal-secret>

Uses Service Principal credentials for authentication.

5. How to Create and Use Groups in Databricks?

Step 1: Create a Group in Databricks

Using Databricks UI

112 | P a g e
Databricks Tutorial
1. Go to Admin Settings → Groups
2. Click Create Group → Enter Group Name
3. Click Save

Using Databricks CLI

Creates a group named DataEngineers.

Step 2: Add Users to a Group

Adds a user to the DataEngineers group.

6. What is SCIM in Databricks?

SCIM (System for Cross-domain Identity Management) automates user provisioning


and group management in Databricks.

SCIM Benefits:
Auto-sync users & groups from Azure AD (Microsoft Entra ID)
Reduces manual user onboarding & offboarding
Ensures role-based access control (RBAC) compliance

7. How to Auto-Provision Users from Microsoft Entra ID (Azure AD) in Databricks?

To enable automatic user provisioning, configure SCIM with Microsoft Entra ID.

Step 1: Enable SCIM in Databricks

1. Go to Admin Settings → User Management


2. Click Enable SCIM

113 | P a g e
Databricks Tutorial
Step 2: Configure SCIM in Azure AD

1. Go to Azure Portal → Open Azure Active Directory


2. Click Enterprise Applications → New Application
3. Search for Databricks → Click Create

Once created:
4. Navigate to Provisioning → Click Start Provisioning
5. Choose Automatic → Add SCIM Token from Databricks
6. Click Save

Step 3: Verify User Sync in Databricks

Go to Admin Settings → Users


Ensure users are auto-provisioned from Azure AD

Summary – Key Takeaways

By using these features, you can efficiently manage users, authentication, and access
control in Databricks!

https://learn.microsoft.com/en-us/azure/databricks/admin/users-groups/scim/aad

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/

Securing Data in Databricks Unity Catalog

This guide covers table security, permissions, access control, privilege management, and
data security in Databricks Unity Catalog.

1. How to Secure Tables in Unity Catalog?

114 | P a g e
Databricks Tutorial
Unity Catalog provides fine-grained access control to secure tables.
Security is managed at catalog, schema, and table levels using GRANT statements.

Steps to Secure a Table in Unity Catalog

Grant Access to a Table

Grants read-only (SELECT) access to a user.

Revoke Access to a Table

Removes access from the user.

Restrict All Access to a Table

Blocks all access to the table.

2. What are Different Permissions Available in Unity Catalog?

Unity Catalog has role-based access controls (RBAC) with different levels of permissions.

Object-Level Permissions in Unity Catalog

115 | P a g e
Databricks Tutorial
The MANAGE permission grants full control over the object.

3. How to Hide Tables in Unity Catalog?

Tables in Unity Catalog cannot be hidden explicitly, but you can restrict access to make
them invisible to users.

Restrict Access to Hide a Table

Users without SELECT permission will not see the table.

Alternative: Use views to control column visibility.

The view exposes only selected columns, effectively hiding the table's full structure.

4. What is MANAGE Permission in Unity Catalog?

The MANAGE permission grants full control over an object.

Example: Granting MANAGE Permission on a Table

This user can modify, delete, and grant/revoke access to others.

5. What is Data Access Control in Databricks?

Data Access Control ensures only authorized users can access or modify data.
Databricks enforces security via Unity Catalog with Role-Based Access Control (RBAC).

Data access control includes:

116 | P a g e
Databricks Tutorial
• Identity & Access Management (IAM) for authentication

• Table & schema-level permissions via SQL

• Row & column-level security

• Encryption & masking sensitive data

6. How to Manage Privileges on Objects in Unity Catalog?

Privileges on objects (tables, schemas, catalogs) are managed via GRANT, REVOKE, and
SHOW GRANTS statements.

Grant a Privilege

Grants SELECT and INSERT permissions to a user group.

Revoke a Privilege

Removes the INSERT permission while keeping SELECT access.

View Assigned Privileges

Displays who has access and what permissions they have.

7. What are Different Ways to Provide Privileges in Unity Catalog?

Unity Catalog supports multiple ways to assign privileges:

Method 1: Grant Permissions Using SQL

117 | P a g e
Databricks Tutorial
Directly assigns permissions to a user.

Method 2: Use Groups for Access Control

Assigns permissions to a group, simplifying access control.

Method 3: Assign Permissions at Catalog Level

Grants access to all schemas & tables inside the catalog.

8. How to Secure Data in Databricks?

Databricks provides multiple layers of security to protect data.

1. Secure Data with Unity Catalog Permissions

Use GRANT & REVOKE to control access to tables, views, and functions.

2. Row-Level Security (RLS) using Views

Users can only access rows matching specific criteria.

3. Column-Level Security Using Masking Functions

118 | P a g e
Databricks Tutorial

Masks sensitive columns from unauthorized users.

4. Data Encryption at Rest and in Transit

Data at rest is encrypted using AES-256


Data in transit is secured with TLS encryption

5. Secure External Storage Using Access Controls

Controls who can access external storage.

6. Audit Logging for Compliance

Databricks logs all user actions in Unity Catalog for compliance.

Tracks who accessed what data and when.

Summary – Key Takeaways

119 | P a g e
Databricks Tutorial

By implementing these security measures, you can ensure controlled, compliant, and
secure data access in Databricks!

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-
catalog/manage-privileges/

https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/

User-Defined Functions (UDFs) in Unity Catalog – Databricks SQL

This guide covers creating functions, scalar vs. table UDFs, writing Python functions, and
registering UDFs in Unity Catalog.

1. How to Create Functions in Unity Catalog in Databricks SQL?

Unity Catalog supports User-Defined Functions (UDFs) to encapsulate reusable logic.


Functions can be created using SQL or Python and registered in Unity Catalog.

Example: Creating a Scalar SQL Function

Creates a function that adds two numbers.

Using the Function in SQL Queries

120 | P a g e
Databricks Tutorial
Returns: 15

2. What are SCALAR and TABLE User-Defined Functions (UDFs) in Databricks?

Databricks SQL supports two main types of UDFs:

3. What is a User-Defined Function (UDF) in Unity Catalog?

A User-Defined Function (UDF) is a custom function that extends SQL’s built-in


capabilities.
UDFs can be written in SQL or Python and stored in Unity Catalog.
They provide modular, reusable, and centrally managed logic.

Example: Creating a Scalar Python UDF in Unity Catalog

Function Usage:

4. What is a User-Defined Table Function (UDTF) in Databricks SQL?

A User-Defined Table Function (UDTF) returns a table instead of a single value.


UDTFs allow processing multiple rows per input row (e.g., splitting a string into
multiple rows).

Example: Creating a Table Function (UDTF) to Split a String

121 | P a g e
Databricks Tutorial

Usage of Table Function:

Returns:

5. How to Write a Python Function in Databricks SQL?

Python functions can be used directly in SQL using the LANGUAGE PYTHON keyword.

Example: Writing a Python Function in Databricks SQL

Function Usage:

6. How to Register a Function in Unity Catalog?


122 | P a g e
Databricks Tutorial
Functions are registered in Unity Catalog so they can be used across workspaces.

Steps to Register a Function in Unity Catalog

1. Define the Function using CREATE FUNCTION.


2. Specify the Catalog & Schema where it will be stored.
3. Grant Access so other users can use the function.

Example: Registering a Function in Unity Catalog

Grant Permission to Other Users

Function Usage in SQL Queries

Summary – Key Takeaways

123 | P a g e
Databricks Tutorial
By using UDFs in Unity Catalog, you can create reusable logic, simplify queries, and
improve efficiency in Databricks SQL.

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-
functions-builtin

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-
ddl-create-sql-function

Data Security and Row-Level Filtering in Databricks Unity Catalog

This guide covers filtering sensitive data, applying row-level security (RLS), and using
dynamic views/queries in Databricks Unity Catalog.

1. How to Filter Sensitive Data in Tables in Unity Catalog?

Unity Catalog provides fine-grained access control to filter sensitive data.


Use Column Masking, Row-Level Security (RLS), and Dynamic Views to control access.

Example: Masking Sensitive Data (Email Masking)

Only [email protected] can see full emails, others see "MASKED".

2. How to Apply Row-Level Filters in Databricks Unity Catalog?

Row-Level Filters restrict data visibility based on user roles, attributes, or dynamic
conditions.
Use Dynamic Views to implement row filtering.

Example: Filtering Orders Based on User's Region

124 | P a g e
Databricks Tutorial

Function current_user_region() should return the region associated with the logged-in
user.
Each user sees only the rows where their region matches.

3. How to Apply Row-Level Security (RLS) in Databricks Unity Catalog?

Row-Level Security (RLS) ensures users see only permitted data based on roles, user
attributes, or business logic.
Use Dynamic Views to enforce RLS.

Example: Row-Level Security Based on User Role

HR Managers see salaries, others see NULL.

4. What are Dynamic Views in Databricks Unity Catalog?

Dynamic Views allow fine-grained security by applying user-based or attribute-based


filtering.
They dynamically adjust based on the current user or their roles.

Example: Creating a Dynamic View for Customer Data

125 | P a g e
Databricks Tutorial

Only [email protected] can see full credit card details, others see
"MASKED".

5. What are Dynamic Queries in Databricks?

Dynamic Queries adjust execution based on current user, roles, or query context.
They can be implemented using Databricks SQL, Python, or Spark SQL.

Example: Dynamic Query Based on User Role

User's role determines which region's data they see.

Summary – Key Takeaways

126 | P a g e
Databricks Tutorial
By implementing Row-Level Security and Dynamic Views, you can enforce fine-grained
access control in Databricks Unity Catalog efficiently!

https://learn.microsoft.com/en-us/azure/databricks/tables/row-and-column-filters

Data Masking and Column-Level Security in Databricks Unity Catalog

This guide covers masking sensitive data, applying column-level security, and handling
Personally Identifiable Information (PII) in Databricks Unity Catalog.

1. How to Mask Sensitive Column Data in Unity Catalog?

Data Masking hides sensitive data such as emails, phone numbers, or financial details
from unauthorized users.
Use CASE statements, Dynamic Views, or Column-Level Security (CLS) to implement
masking.

Example: Masking Email Addresses

Only [email protected] sees full emails, others see "MASKED".

2. How to Apply Column-Level Masking in Databricks Unity Catalog?

Column-Level Masking ensures that specific columns are visible only to authorized
users.
Use Dynamic Views and CASE statements to apply column-level masking.

Example: Masking Credit Card Numbers

127 | P a g e
Databricks Tutorial

Only [email protected] can see full credit card numbers.

3. How to Apply Column-Level Security in Databricks Unity Catalog?

Column-Level Security (CLS) controls which users or groups can access specific columns.
Use GRANT SELECT ON COLUMNS to restrict access at the column level.

Example: Granting Access to Specific Columns

Only [email protected] can see salary and department columns.

4. How to Mask PII (Personally Identifiable Information) Data in Databricks?

PII Masking ensures compliance with GDPR, CCPA, and other regulations.
Use Dynamic Views, Hashing, or Partial Masking.

Example: Partially Masking Social Security Numbers (SSNs)

Only the first 3 digits of SSN are visible; the rest are masked.

128 | P a g e
Databricks Tutorial

5. How to Mask PII Column Data Using Unity Catalog in Databricks?

Use Unity Catalog with Dynamic Views to mask PII columns based on user roles.

Example: Masking Full Names Based on User Role

Only [email protected] sees full names. Others see "MASKED".

Summary – Key Takeaways

By implementing Column-Level Security and PII Masking, you can protect sensitive
data and ensure compliance in Databricks Unity Catalog!

https://learn.microsoft.com/en-us/azure/databricks/tables/row-and-column-filters

129 | P a g e
Databricks Tutorial

130 | P a g e
Databricks Tutorial

131 | P a g e

You might also like