databricks-dr-examples

A collection of minimal example scripts for setting up Disaster Recovery for Databricks.

This code is provided as-is and is meant to serve as a set of baseline examples. You may need to alter these scripts to work in your environment.

Notes on cross-workspace connectivity

These scripts generally assume that they will be run on a notebook in the primary workspace, and that the workspace can directly access the secondary workspace via SDK; this may not always be true in your environment. You have two options if connectivity issues are preventing scripts from running:

Alter the workspace networking to allow connectivity; this may involve adjusting firewalls, adding peering, etc.
Run the scripts remotely using Databricks Connect

In the latter option, the following adjustments need to be made:

Set up Databricks Connect in your environment
In all scripts, add an import statement for Databricks Connect, i.e., from databricks.connect import DatabricksSession
In all scripts, instantiate a Spark Session, i.e., spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()

These changes will allow the code to run remotely on a local machine or cloud VM.

Repo Contents

Snippets that demonstrate basic functionality (located in /examples/):

clone_to_secondary.py: performs DEEP CLONE on a set of catalogs in the primary to a storage location in the secondary region.
clone_to_secondary_par.py: parallelized version of clone_to_secondary.py.
create_tables_simple.py: simple script that must be run in the secondary region to register managed/external tables based on the output of clone_to_secondary.py.
sync_views.py: simple script to sync views; this will need to be updated per your environment.

Code samples that show more comprehensive end-to-end functionality:

sync_creds_and_locs.py: script to sync storage credentials and external locations between primary and secondary metastores. Run locally or on either primary/secondary.
sync_catalogs_and_schemas.py: script to sync all catalogs and schemas from a primary metastore to a secondary metastore. Run locally or on either primary/secondary.
sync_tables.py: performs a deep clone of all managed external tables, and registers those tables in the secondary region.
sync_grs_ext.py: sync metadata only for external tables that have already been replicated via cloud provider georeplication. No data is copied, and storage URLs on both workspaces will be the same.
sync_ext_volumes.py: sync metadata only for external volumes that have already been replicated via cloud provider georeplication.
sync_perms.py: sync all permissions related to UC tables, volumes, schemas and catalogs from primary to secondary metastore.
sync_shared_tables.py: sync tables using Delta Sharing. All tables will be imported to the secondary region as managed tables.

How to use this Repository

Prerequisites

Before running the script, make sure you have the following:

A Databricks workspace with Admin privileges to access and manage catalogs and schemas.
- Need to have CREATE CATALOG Privileges on the Metastore
- Need to have CREATE EXTERNAL LOCATION Privileges on the Metastore
- Need to have CREATE STORAGE CREDENTIAL Privileges on the Metastore
Databricks CLI installed and configured with your workspace. Follow the Databricks CLI installation guide for setup instructions.
- If using in a notebook, make sure the latest version is installed.
- Requests library installed for making API calls to Databricks/
Python 3.6+ and pip installed on your local machine.

Clone this repository to your local machine:

git clone https://github.com/gregwood-db/databricks-dr-examples.git
cd databricks-dr-examples

Setting up variables and parameters

Set the following variable/parameter values in common.py; these will be used throughout the other scripts.

cloud_type: Cloud provider where workspaces exist (azure, aws, or gcp)
cred_mapping_file: The mapping file for credentials, i.e., data/azure_cred_mapping.csv
loc_mapping_file: The location mapping file, i.e. data/ext_location_mapping.csv
catalog_mapping_file: The catalog mapping file, i.e. data/catalog_mapping.csv
schema_mapping_file: The schema mapping file, i.e. data/schema_mapping.csv
source_host: Source/Primary Workspace URL, including leading https://
target_host: Target/Secondary Workspace URL, including leading https://
source_pat: Personal Access Tokens (PAT) for Source/Primary Workspace
target_pat: Personal Access Tokens (PAT) for Target/Secondary Workspace URL
catalogs_to_copy: A list of strings, containing names of catalogs to replicate
metastore_id: The global unique metastore ID of the secondary/target metastore
landing_zone_url: ADLS/S3/GCS location used to land intermediate data in the secondary region
num_exec: Number of parallel threads to execute (when parallelism is used)
warehouse_size: The size of the serverless SQL warehouse used in the secondary workspace
response_backoff: The polling backoff for checking query state when creating tables/views
manifest_name: the name of the table manifest Delta file, if using sync_tables.py

Syncing External Locations and Credentials

Make sure common.py is updated with all relevant parameters
Update the credential and external location mapping files:
- _cred_mapping.csv:
  - source_cred_name should contain the storage credential name in the source metastore
  - for AWS, target_iam_role should contain the ARN for the iam role to be used in the target metastore
  - for Azure, only ONE of the following should be used:
    - target_mgd_id_connector: used for standard access connectors (i.e., azure managed identities)
    - target_mgd_id_identity: used for user-assigned managed identities
    - target_sp_directory, target_sp_appid, and target_sp_secret: if a service principal is used for access (uncommon)
- ext_location_mapping.csv:
  - source_loc_name should contain the external location name in the source metastore
  - target_url should contain the storage URL to be used for the location in the secondary metastore
  - target_access_pt should contain the S3 access point to be used (optional, only for AWS)
Once you have updated the configuration, you can run the script with the following command:

python sync_creds_and_locs.py

Syncing Catalogs and Schemas

Make sure common.py is updated with all relevant parameters
Update the catalog and schema mapping CSVs:
- catalog_mapping.csv: source_catalog should contain a list of the source catalog names to be migrated, and target_storage_root should contain the storage root location for each catalog in the target metastore
- schema_mapping.csv: source_catalog and source_schema should contain the catalogs and schemas in the source metastore, and target_storage_root should contain the storage root location for each schema in the target metastore
Once you have updated the configuration, you can run the script with the following command:

python sync_catalogs_and_schemas.py

Syncing Tables

Below are two options for syncing tables. Option 1 leverages Delta sharing to clone the tables, whereas Option 2 requires you to delta deep clone the tables in the source/primary region to an intermediary cloud storage bucket before re-creating the tables as managed tables in the target/secondary metastore.

Option 1: Syncing Managed Tables via Delta Sharing

Make sure common.py is updated with all relevant parameters
Once you have updated the configuration, you can run the script with the following command:

python sync_shared_tables.py

Option 2: Syncing Managed Tables with an Intermediary Storage Account

Make sure common.py is updated with all relevant parameters
Once you have updated the configuration, you can run the script with the following command:

python sync_tables.py

Option 3: Syncing External Tables

Make sure common.py is updated with all relevant parameters
Once you have updated the configuration, you can run the script with the following command:

python sync_grs_ext.py

Syncing External Volumes

Make sure common.py is updated with all relevant parameters
Once you have updated the configuration, you can run the script with the following command:

python sync_ext_volumes.py

Syncing Views

Make sure common.py is updated with all relevant parameters
Once you have updated the configuration, you can run the script with the following command:

python sync_views.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

databricks-dr-examples

Notes on cross-workspace connectivity

Repo Contents

How to use this Repository

Prerequisites

Setting up variables and parameters

Syncing External Locations and Credentials

Syncing Catalogs and Schemas

Syncing Tables

Option 1: Syncing Managed Tables via Delta Sharing

Option 2: Syncing Managed Tables with an Intermediary Storage Account

Option 3: Syncing External Tables

Syncing External Volumes

Syncing Views

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
data		data
docs		docs
examples		examples
LICENSE		LICENSE
README.md		README.md
common.py		common.py
sync_catalogs_and_schemas.py		sync_catalogs_and_schemas.py
sync_creds_and_locs.py		sync_creds_and_locs.py
sync_ext_volumes.py		sync_ext_volumes.py
sync_grs_ext.py		sync_grs_ext.py
sync_perms.py		sync_perms.py
sync_shared_tables.py		sync_shared_tables.py
sync_tables.py		sync_tables.py
sync_views.py		sync_views.py

License

gopi504/databricks-dr-examples

Folders and files

Latest commit

History

Repository files navigation

databricks-dr-examples

Notes on cross-workspace connectivity

Repo Contents

How to use this Repository

Prerequisites

Setting up variables and parameters

Syncing External Locations and Credentials

Syncing Catalogs and Schemas

Syncing Tables

Option 1: Syncing Managed Tables via Delta Sharing

Option 2: Syncing Managed Tables with an Intermediary Storage Account

Option 3: Syncing External Tables

Syncing External Volumes

Syncing Views

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages