A collection of minimal example scripts for setting up Disaster Recovery for Databricks.
This code is provided as-is and is meant to serve as a set of baseline examples. You may need to alter these scripts to work in your environment.
These scripts generally assume that they will be run on a notebook in the primary workspace, and that the workspace can directly access the secondary workspace via SDK; this may not always be true in your environment. You have two options if connectivity issues are preventing scripts from running:
- Alter the workspace networking to allow connectivity; this may involve adjusting firewalls, adding peering, etc.
- Run the scripts remotely using Databricks Connect
In the latter option, the following adjustments need to be made:
- Set up Databricks Connect in your environment
- In all scripts, add an import statement for Databricks Connect, i.e.,
from databricks.connect import DatabricksSession - In all scripts, instantiate a Spark Session, i.e.,
spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
These changes will allow the code to run remotely on a local machine or cloud VM.
Snippets that demonstrate basic functionality (located in /examples/):
- clone_to_secondary.py: performs
DEEP CLONEon a set of catalogs in the primary to a storage location in the secondary region. - clone_to_secondary_par.py: parallelized version of clone_to_secondary.py.
- create_tables_simple.py: simple script that must be run in the secondary region to register managed/external tables based on the output of clone_to_secondary.py.
- sync_views.py: simple script to sync views; this will need to be updated per your environment.
Code samples that show more comprehensive end-to-end functionality:
- sync_creds_and_locs.py: script to sync storage credentials and external locations between primary and secondary metastores. Run locally or on either primary/secondary.
- sync_catalogs_and_schemas.py: script to sync all catalogs and schemas from a primary metastore to a secondary metastore. Run locally or on either primary/secondary.
- sync_tables.py: performs a deep clone of all managed external tables, and registers those tables in the secondary region.
- sync_grs_ext.py: sync metadata only for external tables that have already been replicated via cloud provider georeplication. No data is copied, and storage URLs on both workspaces will be the same.
- sync_ext_volumes.py: sync metadata only for external volumes that have already been replicated via cloud provider georeplication.
- sync_perms.py: sync all permissions related to UC tables, volumes, schemas and catalogs from primary to secondary metastore.
- sync_shared_tables.py: sync tables using Delta Sharing. All tables will be imported to the secondary region as managed tables.
Before running the script, make sure you have the following:
- A Databricks workspace with Admin privileges to access and manage catalogs and schemas.
- Need to have CREATE CATALOG Privileges on the Metastore
- Need to have CREATE EXTERNAL LOCATION Privileges on the Metastore
- Need to have CREATE STORAGE CREDENTIAL Privileges on the Metastore
- Databricks CLI installed and configured with your workspace. Follow the Databricks CLI installation guide for setup instructions.
- If using in a notebook, make sure the latest version is installed.
- Requests library installed for making API calls to Databricks/
- Python 3.6+ and pip installed on your local machine.
Clone this repository to your local machine:
git clone https://github.com/gregwood-db/databricks-dr-examples.git
cd databricks-dr-examples
Set the following variable/parameter values in common.py; these will be used throughout the other scripts.
cloud_type: Cloud provider where workspaces exist (azure, aws, or gcp)cred_mapping_file: The mapping file for credentials, i.e.,data/azure_cred_mapping.csvloc_mapping_file: The location mapping file, i.e.data/ext_location_mapping.csvcatalog_mapping_file: The catalog mapping file, i.e.data/catalog_mapping.csvschema_mapping_file: The schema mapping file, i.e.data/schema_mapping.csvsource_host: Source/Primary Workspace URL, including leadinghttps://target_host: Target/Secondary Workspace URL, including leadinghttps://source_pat: Personal Access Tokens (PAT) for Source/Primary Workspacetarget_pat: Personal Access Tokens (PAT) for Target/Secondary Workspace URLcatalogs_to_copy: A list of strings, containing names of catalogs to replicatemetastore_id: The global unique metastore ID of the secondary/target metastorelanding_zone_url: ADLS/S3/GCS location used to land intermediate data in the secondary regionnum_exec: Number of parallel threads to execute (when parallelism is used)warehouse_size: The size of the serverless SQL warehouse used in the secondary workspaceresponse_backoff: The polling backoff for checking query state when creating tables/viewsmanifest_name: the name of the table manifest Delta file, if using sync_tables.py
-
Make sure
common.pyis updated with all relevant parameters -
Update the credential and external location mapping files:
- _cred_mapping.csv:
source_cred_nameshould contain the storage credential name in the source metastore- for AWS,
target_iam_roleshould contain the ARN for the iam role to be used in the target metastore - for Azure, only ONE of the following should be used:
target_mgd_id_connector: used for standard access connectors (i.e., azure managed identities)target_mgd_id_identity: used for user-assigned managed identitiestarget_sp_directory,target_sp_appid, andtarget_sp_secret: if a service principal is used for access (uncommon)
- ext_location_mapping.csv:
source_loc_nameshould contain the external location name in the source metastoretarget_urlshould contain the storage URL to be used for the location in the secondary metastoretarget_access_ptshould contain the S3 access point to be used (optional, only for AWS)
- _cred_mapping.csv:
-
Once you have updated the configuration, you can run the script with the following command:
python sync_creds_and_locs.py
-
Make sure
common.pyis updated with all relevant parameters -
Update the catalog and schema mapping CSVs:
- catalog_mapping.csv:
source_catalogshould contain a list of the source catalog names to be migrated, andtarget_storage_rootshould contain the storage root location for each catalog in the target metastore - schema_mapping.csv:
source_catalogandsource_schemashould contain the catalogs and schemas in the source metastore, andtarget_storage_rootshould contain the storage root location for each schema in the target metastore
- catalog_mapping.csv:
-
Once you have updated the configuration, you can run the script with the following command:
python sync_catalogs_and_schemas.py
Below are two options for syncing tables. Option 1 leverages Delta sharing to clone the tables, whereas Option 2 requires you to delta deep clone the tables in the source/primary region to an intermediary cloud storage bucket before re-creating the tables as managed tables in the target/secondary metastore.
-
Make sure
common.pyis updated with all relevant parameters -
Once you have updated the configuration, you can run the script with the following command:
python sync_shared_tables.py
-
Make sure
common.pyis updated with all relevant parameters -
Once you have updated the configuration, you can run the script with the following command:
python sync_tables.py
-
Make sure
common.pyis updated with all relevant parameters -
Once you have updated the configuration, you can run the script with the following command:
python sync_grs_ext.py
-
Make sure
common.pyis updated with all relevant parameters -
Once you have updated the configuration, you can run the script with the following command:
python sync_ext_volumes.py
-
Make sure
common.pyis updated with all relevant parameters -
Once you have updated the configuration, you can run the script with the following command:
python sync_views.py