0% found this document useful (0 votes)

5 views26 pages

Critical Databricks

The document outlines 15 critical mistakes advanced developers make while using Databricks, focusing on areas such as security, workflow optimization, and environment management. It provides practical solutions to these mistakes, including proper time zone handling, environment separation, and the use of Databricks Workflows for orchestration. The document emphasizes best practices for security, resource management, and efficient workflow execution to enhance overall performance and reduce costs.

Uploaded by

rajeshganta.de7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views26 pages

Critical Databricks

Uploaded by

rajeshganta.de7799

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

15 Critical Databricks Mistakes Advanced

Developers Make: Security, Workflows,

Environment

Ganesh R
Azure Data Engineer
Uncover 15 critical Databricks mistakes that Advanced developers often encounter. Learn essential tips on
workflow optimization, security best practices, Git integration, environment management, an effective
cluster strategy, and cost-saving tips.

I continue sharing practical information about common mstakes in Databricks and ways to fix them. I
recommend reading the first part: « 11 Common Databricks Mistakes Beginners Make: Best Practices for
Data Management and Coding »
1. Ignoring UTC zone when calculating dates
Mistake: Many developers use current date functions without considering Databricks clusters running in the
UTC zone. This results in incorrect date calculations when scripts are run late at night or early in the
morning in your local time zone, which can result in data being processed for the wrong day.

-- Problematic approach (uses UTC) if you are using Spark SQL

% sql
SELECT * FROM TABLE
WHERE FILE_DATE >= current_date ()

# Problematic approach (uses UTC) if you are using Spark

today = datetime.now().date()

How to fix it:

The first way is to set the default time zone on the cluster:
# Set the default time zone for a Spark session
spark.conf. set ( "spark.sql.session.timeZone" , "America/New_York" )

Here’s how to set up your time zone for Spark:

from datetime import datetime

import pytz

local_tz = pytz.timezone( 'America/New_York' ) # or your local time zone

today = datetime.now(pytz.UTC).astimezone(local_tz).date()

Here’s how to set timezone in Spark SQL:

% sql
SELECT * FROM TABLE
WHERE FILE_DATE >= date (from_utc_timestamp(current_timestamp (), 'EST' ))
2. Working in a single environment without proper separation of development and
production.
Mistake: Working in a single Databricks environment without a clear separation between Dev and Prod is a
common mistake that can lead to data loss. Ideally, you should create separate Databricks workspaces for
different environments (e.g., Dev, QA, Prod). It also happens that, due to inexperience, developers
manually change tables for DEV and PROD in each request.

How to fix it: However, creating multiple environments is not always feasible due to budget constraints or
other limitations. When multiple separate environments can’t be set up, a practical compromise is logical
separation within a single environment. One convenient approach is to use schema and table prefixing
within your Notebooks.

It is necessary to divide the workspace with Notebooks into two folders DEV and PROD. Or to develop in
personal folders, and all Notebooks that will be installed on schedule should be moved to the common
PROD folder.

Start each notebook by explicitly defining environment parameters, such as schema_prefix variables. In
this example, we will define the path to the Notebook and check if the name contains the word (PROD):
# defining DEV or PROD environment
notebook_path = (dbutils.notebook.entry_point.getDbutils().notebook().getContext
if "(PROD)" in notebook_path.upper():
schema_prefix = "catalog.prod_schema" 
else:
schema_prefix = "catalog.dev_schema"

And when we use the query, and the Notebook is not located in the folder named (PROD), then we will not
spoil the real tables:

spark.sql(f"""
DELETE FROM {schema_prefix}.clients
WHERE
DATE = current_date()
""")
3. Building long chains of processes using %run commands within notebooks, instead
of proper Workflow orchestration.
Mistake: Many developers working with notebooks often rely excessively on the %run magic command to
execute external scripts or notebooks. While

%run is convenient and suitable for quickly loading variables or libraries

across notebooks, some developers extend this approach to build complex, lengthy processing chains. This
habit results primarily from a lack of awareness of specialized workflow orchestration tools.

%run .variables

Although it might seem practical at first, relying heavily on %run chains can quickly become
counterproductive. Such a strategy makes processes difficult to debug, track, and scale. Long %run
cascades also become fragile: if one step fails, the entire process may halt unexpectedly.

How to fix it: Instead of extensive use of %run commands, it is recommended to adopt Databricks
Workflows, a robust orchestration
solution built directly into Databricks notebooks for seamless pipeline management. Databricks Workflows
offer distinct benefits such as:

Integrated Environment: Effortlessly orchestrate notebooks, jobs, and tasks within a

unified Databricks workspace.
Clear Task Dependencies: Define and manage dependencies visually, simplifying
complex process flows.
Fault Tolerance: Easily handle errors by setting retries, conditional logic, and alerts,
improving robustness.
Enhanced Monitoring and Logging: Access detailed logs, metrics, and performance
tracking within the Databricks platform to quickly identify and address workflow
issues.
Flexible Scheduling Options: Schedule workflows based on time intervals or trigger
conditions, automating execution according to business requirements.
Simplified Scaling: Effortlessly scale and modify workflows to accommodate
evolving needs without extensive refactoring.
By leveraging Databricks Workflows, teams significantly enhance flexibility, transparency, and stability,
eliminating the limitations of chaining notebook executions with %run.
4. Failing to grant workflow permission to other team members/groups.

Mistake: Failing to grant workflow permission to other team members or groups. By default, workflow
visibility is restricted only to its creator.
How to fix it: After creating your workflow, navigate to workflow settings or permissions. Explicitly add team
members or groups who require access. Define appropriate permission levels: Can Manage or Can View
according to their responsibilities.

5. Forgetting to set threshold alerts or monitoring on Databricks Workflows.

Mistake: Forgetting to set threshold alerts or monitoring parameters in Databricks workflows.

Without proper alert thresholds, workflow failures or unexpected performance degradation can go
unnoticed for days, resulting in inefficient resource usage and unnecessary budget consumption.

How to fix it: Review historical execution durations of Notebook or Workflow tasks under typical operating
conditions. Determine the normal run duration, then set your threshold alerts to approximately double this
typical duration. Effective threshold monitoring promptly alerts teams about anomalies or
stalled workflows, ensuring quick issue resolution and significant cost savings by reducing prolonged
unnecessary resource use.

6. Ensure Your Team Receives Error Notifications in Databricks Workflows.

Mistake: In Databricks Workflows, error notifications by default go only to the workflow creator (author).
However, sometimes the workflow developer forgets to change this default and assigns the notifications to
a working group, or worse, removes even their contact from notification settings. As a result, when
workflows fail, nobody gets notified promptly, and important issues go unnoticed.
How to fix it:
Always assign a shared working group as recipients of workflow error
notifications rather than to individual developers.
Regularly audit Databricks workflows for proper notification settings to ensure
no workflows are left without alert recipients.
7. Not using Job Clusters for workflows (instead mistakenly relying on interactive
clusters).
Mistake: Using an interactive cluster rather than the dedicated job cluster for automated or scheduled
tasks.

Databricks offers two main types of clusters:

Interactive cluster — used early in a project or during script development. Ideal for
interactive, exploratory tasks where quick modifications and manual execution are
required. These clusters typically remain running as long as manually enabled.
Job cluster — designed specifically for scheduled workflows or automated script
execution. They automatically spin up when scheduled tasks start and immediately
shut down upon completion, preventing unnecessary resource and budget waste.

How to fix it:

Always choose a Job cluster for scheduled scripts or workflows. Configure the Job cluster
specifically for your script’s resource needs.
Regularly review existing workflows to ensure they’re using Job clusters, not interactive
clusters.

Using Job clusters instead of interactive clusters allows efficient resource management,
reduces costs, ensures smoother operation of your scheduled processes, and prevents
potential conflicts.

8. Setting up interactive Databricks clusters without auto- termination enabled.

Mistake: Setting up interactive Databricks clusters without enabling auto- termination

results in clusters running continuously, even when nobody is actively using them. This
oversight can quickly consume your Databricks budget unnecessarily.

How to fix it: When creating an interactive cluster, always enable the auto- termination
feature in the cluster configuration. Set the recommended inactivity period (commonly 30–
60 minutes), which ensures the cluster automatically shuts down after being idle for the
specified timeframe.
9. Forgetting to run regular VACUUM on Delta tables.
Mistake: Forgetting to regularly run VACUUM on Delta tables can lead to accumulating unused data files.
This wastes storage space and negatively impacts overall performance.

How to fix it: Regularly perform VACUUM operations to clean up old, unused data files. Or automate
VACUUM by creating scheduled Databricks jobs aligned with each table’s data update frequency:

VACUUM table_name RETAIN 168 HOURS; -- retain 7 days

VACUUM table_name RETAIN 1440 HOURS; -- retain 60 days

Regular VACUUM reduces storage costs, improves query efficiency, and ensures your Delta tables remain
organized and optimized.
10. Storing passwords or secrets directly in code (instead of using Databricks
Secrets).
Mistake: Including passwords or sensitive keys directly inside your code is a common security issue.

my_password = "SecretPassword123"

How to fix it: I propose a solution for Microsoft Azure. Key Vault Secrets is a secure cloud service for storing
passwords, API keys, and other sensitive information. It helps you centralize, secure, and control access to
secrets.

Typically, your cloud administrator creates a secret (such as a database password) in Azure Key Vault, and
then you can securely access that secret from your Azure Databricks Notebook. This approach ensures that
you are following security best practices, and sensitive information is never stored directly in your code.
Cloud Admin: Creates secret my-db-password ) in Azure Key Vault ( my-( keyvault ).

You (in Databricks notebook): Link Azure Key Vault to Databricks using Secret Scope and
access the secret from your Notebook:

my_password = dbutils.secrets.get(scope="my-keyvault-scope", key="my-db-password

11. Connecting directly to on-prem databases from cloud Notebooks.

Mistake: In some cases, companies leave access open from cloud notebooks directly to their on-
premises databases or file systems. Employees who know connection details (host, port, and
credentials) then connect directly. As mentioned before, employees might also mistakenly store
passwords directly inside notebooks (covered previously in point 9).
Risks Associated:
Security vulnerabilities: Unauthorized access or password leak. Lack of audit trails and control: If
everyone connects individually, it is difficult to track who accessed sensitive resources and when.
Performance issues: Direct queries might overload on-prem databases, causing disruptions for other
users.
Compliance issues: Direct external connections might violate data regulations or policies.

How to fix it: Instead, companies should adopt centralized data loading frameworks or solutions. A
correct, secure, and robust way is to use specialized tools or frameworks (such as Azure Data
Factory pipelines, Apache Airflow, or securely managed ETL processes) to regularly and safely sync
on-prem data to cloud storage or databases. Employees then connect notebooks and analytic
applications only to pre-prepared and centrally managed datasets residing securely in the cloud.
12. Not utilizing Git version control integration.

Mistake: Your team is not using Git to keep track of script changes. That makes it hard to
see who changed what, creates challenges in backing up scripts, and makes finding specific
code very difficult.

How to fix it:

Start using Git: Keep all your scripts in a Git repository. Save changes regularly:
Frequently add and commit your updates to clearly track changes over time.
Use branches: Make separate branches when working on new features or testing, so
your main version remains stable.
Train your team: Teach team members basic Git commands and explain why it’s useful.
Easy script search: Use Git’s built-in tools to quickly find needed scripts or changes.
13. Not encrypting sensitive data when migrating from on- premise to cloud
environments.

Mistake: When migrating databases from on-premise to Databricks cloud environments, developers
often transfer tables containing sensitive data without implementing proper encryption. This
creates significant security vulnerabilities as cloud-stored data requires stronger protection.
Common issues include:

Moving customer PII, financial data, or healthcare information in plaintext.

Storing sensitive columns (SSNs, credit card numbers, addresses) without encryption.
Failing to identify which table columns need encryption.
Using actual PII data in development and testing environments.
Disregarding company security policies for handling sensitive data.

How to fix it:

1. Implement Strong Encryption Standards:

Encrypt sensitive columns at rest and in transit using proven encryption standards like AES-
256. Databricks, along with cloud providers (AWS KMS or
Azure Key Vault), offers integration for centralized key management.
from pyspark.sql.functions import col, sha2

encrypted_df = customer_df.withColumn("SSN_encrypted", sha2(col("SSN"),256))

 

%sql
SELECT
*,
sha2(SSN, 256) AS SSN_encrypted
FROM customer

2. Utilize Data Masking and Anonymization for Non-production Environments:

For testing, development, or analytics environments, mask or anonymize sensitive data to

minimize exposure risk.
%sql
SELECT
CONCAT('user', FLOOR(RAND()*10000), '@example.com') AS masked_email, CONCAT('**** **** ****
', RIGHT(credit_card_number,4)) AS masked_credit_card FROM customer_transactions

14. Leaving PII Data Exposed in Files During Data Transfers.

Mistake: While developers often focus on encrypting database tables, they frequently overlook that

sensitive PII data also exists in files (CSV, JSON, Parquet, Excel, etc.) being transferred to

Databricks. Common oversights include:

Uploading unencrypted files containing customer information to DBFS or cloud storage Transferring

sensitive files over insecure protocols (FTP, HTTP instead of SFTP, HTTPS) Neglecting to delete

sensitive files after processing Failing to implement proper access controls for files containing PII
Sending files with PII data via email, which is highly insecure and violates most data protection policies

How to fix it: To prevent unauthorized access and exposure of PII data within files during transfers, the
following best practices should be implemented:

File encryption (e.g., with GPG): Always encrypt sensitive files such as CSV, JSON, Excel, or
Parquet with an established encryption standard such as GPG (GNU Privacy Guard) before
transferring them. This ensures that only authorized individuals or processes with the correct
decryption keys can access the data, significantly reducing the risk of accidental disclosure.
Secure protocols: Use secure transfer protocols such as HTTPS, SFTP, or SCP instead of
insecure options such as FTP or HTTP. These protocols provide built-in data encryption during
transfer, thereby preventing eavesdropping and unauthorized access.
Proper access control: Implement strict role-based permissions and access controls for
sensitive files on file systems, cloud storage, and DBFS. Ensure that only individuals who require
access can read, download, or process these data sets.
Automated cleanup: Set up automated processes to securely delete or archive sensitive files
after processing. Avoid leaving unnecessary copies of files in storage locations (cloud
storage, DBFS, etc.). Avoid sending data via email: Communicate strict policies prohibiting
the use of email to send sensitive PII. Provide clear instructions and secure alternatives
(e.g., links to restricted cloud storage or secure file transfer por tals).

15. Manually saving files received by email for processing by the script.

Mistake: Some employees manually download files received via email and then upload
them individually to storage locations (such as a shared folder or cloud storage) for
processing by Databricks notebooks or scripts. This manual process can lead to delays,
mistakes, and missing data, and can reduce workflow efficiency.

How to fix it: Automate this process using Power Automate in combination with Databricks
Volumes:
Set up Power Automate to automatically detect incoming emails with attachments.
Automatically save attachments directly to Databricks Volumes (the new built-in file
storage in Databricks).

Databricks Notebooks can efficiently and directly access these files in volumes, ensuring
simple and secure processing.
Be sure to grant the appropriate permissions to the Databricks workgroup and workspace
for the volume storage.

Do you recognize yourself in any of these mistakes?

If you find yourself making any of these advanced Databricks mistakes, it’s worth reviewing
the basics. Check out our previous article, “ 11 Common Databricks Mistakes Beginners
Make: Best Practices for Data Management and Coding,” to make sure you’re confident in
your grasp of the basics.
Follow for more content like this Azure
Cloud for Data Engineering

Ganesh R
Azure Data Engineer

https://www.linkedin.com/in/rganesh0203/

Spark QA
No ratings yet
Spark QA
34 pages
Azure Databricks Documentation
100% (1)
Azure Databricks Documentation
7,197 pages
SwagLabs Documentation
No ratings yet
SwagLabs Documentation
9 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Databricks Academy Classroom Notes
No ratings yet
Databricks Academy Classroom Notes
19 pages
Domino Admin Basics Course
No ratings yet
Domino Admin Basics Course
4 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
PAL3 RSI User Manual FW2.1.2
100% (2)
PAL3 RSI User Manual FW2.1.2
932 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Getting Started With Databricks
No ratings yet
Getting Started With Databricks
39 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
SQL Solutions
No ratings yet
SQL Solutions
59 pages
Test 3
No ratings yet
Test 3
4 pages
3d Password
100% (2)
3d Password
20 pages
Introduction To Databricks A Beginneers Guide
No ratings yet
Introduction To Databricks A Beginneers Guide
20 pages
Unity Catalog
No ratings yet
Unity Catalog
8 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Saep 99
No ratings yet
Saep 99
33 pages
Databricks Certified Data Engineer Professional Exam Guide 1 Mar 2025
No ratings yet
Databricks Certified Data Engineer Professional Exam Guide 1 Mar 2025
6 pages
5 - Databricks in Production
No ratings yet
5 - Databricks in Production
86 pages
Intune Device Enrollment Guide
No ratings yet
Intune Device Enrollment Guide
227 pages
Databricks Course Deck
No ratings yet
Databricks Course Deck
134 pages
Execr
No ratings yet
Execr
4 pages
Student Registration Student Registration
No ratings yet
Student Registration Student Registration
13 pages
Cluster Configuration and Spark UI Databricks 1721934901
No ratings yet
Cluster Configuration and Spark UI Databricks 1721934901
3 pages
NASA Cybersecurity Report
No ratings yet
NASA Cybersecurity Report
24 pages
Apache Spark Performance Troubleshooting at Scale Challenges, Tools and Methods
No ratings yet
Apache Spark Performance Troubleshooting at Scale Challenges, Tools and Methods
48 pages
Sap Basis (BC) Security Components
No ratings yet
Sap Basis (BC) Security Components
8 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
Big Book of Data Science Use Cases v3
No ratings yet
Big Book of Data Science Use Cases v3
86 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Azure Databricks
No ratings yet
Azure Databricks
2 pages
Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
No ratings yet
Azure Databricks: Job Performance Monitoring, Troubleshooting and Optimization - by Prashanth Kumar - Feb, 2024 - Medium
41 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Spark Troubleshooting, Part 2: Five Types of Solutions
No ratings yet
Spark Troubleshooting, Part 2: Five Types of Solutions
7 pages
The Lamppost: Important Dates
No ratings yet
The Lamppost: Important Dates
10 pages
Databricks Intermediate Guide
No ratings yet
Databricks Intermediate Guide
1 page
Database Security Term Paper
100% (1)
Database Security Term Paper
5 pages
Common Pitfalls in AI Project Organization: 1. Mixing Raw and Processed Data
No ratings yet
Common Pitfalls in AI Project Organization: 1. Mixing Raw and Processed Data
2 pages
bd1718 10 Spark
No ratings yet
bd1718 10 Spark
55 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
27 pages
Chapter Four: Ethics and Information Security
No ratings yet
Chapter Four: Ethics and Information Security
46 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
Optimize Spark Partitioning & Performance
No ratings yet
Optimize Spark Partitioning & Performance
11 pages
The Big Book of Data Science Use Cases
No ratings yet
The Big Book of Data Science Use Cases
80 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Achieving Production-Grade Data Quality Using Data... - Databricks Community - 119386
No ratings yet
Achieving Production-Grade Data Quality Using Data... - Databricks Community - 119386
11 pages
Databricks
No ratings yet
Databricks
4 pages
Common Issues in PySpark and How To Resolve Them
No ratings yet
Common Issues in PySpark and How To Resolve Them
3 pages
Lab 2 - Setting Up Azure Databricks Workspace & Cluster
No ratings yet
Lab 2 - Setting Up Azure Databricks Workspace & Cluster
3 pages
Enhanced Databricks Training Agenda
No ratings yet
Enhanced Databricks Training Agenda
3 pages
Apache Spark Things To Know
No ratings yet
Apache Spark Things To Know
8 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Common Issues in PySpark and How To Resolve Them
No ratings yet
Common Issues in PySpark and How To Resolve Them
3 pages
Databricks Workspace Guide
No ratings yet
Databricks Workspace Guide
27 pages
Complaint in Case Against CCSD Regarding Ransomware Attack
No ratings yet
Complaint in Case Against CCSD Regarding Ransomware Attack
47 pages
InStat Scout 3.0 Manual
No ratings yet
InStat Scout 3.0 Manual
64 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Cluster in Databricks
No ratings yet
Cluster in Databricks
9 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
UPDATED - CERT-EU - Security - Whitepaper - 2014-007 - Kerberos - Golden - Ticket - Protection - v1 - 4
No ratings yet
UPDATED - CERT-EU - Security - Whitepaper - 2014-007 - Kerberos - Golden - Ticket - Protection - v1 - 4
14 pages
THYZQh Meot
No ratings yet
THYZQh Meot
13 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
Recap Spark
No ratings yet
Recap Spark
21 pages
The Devops Field Guide
No ratings yet
The Devops Field Guide
44 pages
The DevOps Field Guide
No ratings yet
The DevOps Field Guide
44 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Computer Security
No ratings yet
Computer Security
7 pages
POwershell
No ratings yet
POwershell
4 pages
Xtraction 15.1 Administrator Guide
No ratings yet
Xtraction 15.1 Administrator Guide
27 pages
AUGUST - IT - CLASS - X - Information and Communication Technology Skills MCQ
No ratings yet
AUGUST - IT - CLASS - X - Information and Communication Technology Skills MCQ
9 pages
Spark Databricks
No ratings yet
Spark Databricks
19 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Private Disk Operational Guide
No ratings yet
Private Disk Operational Guide
51 pages
Livestreaming Les Mills Workouts Guide
No ratings yet
Livestreaming Les Mills Workouts Guide
10 pages
Hadoop Recap
No ratings yet
Hadoop Recap
27 pages
IT Ethics: Cybersecurity Challenges
No ratings yet
IT Ethics: Cybersecurity Challenges
9 pages
Cluster Size
No ratings yet
Cluster Size
4 pages
Top50 Python
No ratings yet
Top50 Python
21 pages
0221 Data - Sheet - Bravura - Privilege - Cobranding
No ratings yet
0221 Data - Sheet - Bravura - Privilege - Cobranding
2 pages
Internal Lab Security Policy
No ratings yet
Internal Lab Security Policy
3 pages
Manual Easy Way Software
No ratings yet
Manual Easy Way Software
62 pages
Index: Term Paper On Web Authentication
No ratings yet
Index: Term Paper On Web Authentication
16 pages
PBBSC Nursing Cnet 2024 Information Brochure Final
No ratings yet
PBBSC Nursing Cnet 2024 Information Brochure Final
15 pages
Three Level Authentication
No ratings yet
Three Level Authentication
32 pages
17 Joint Committee On The Personal Data Protection Bill 2019 1
No ratings yet
17 Joint Committee On The Personal Data Protection Bill 2019 1
542 pages

Critical Databricks

Uploaded by

Critical Databricks

Uploaded by

15 Critical Databricks Mistakes Advanced

Developers Make: Security, Workflows,

-- Problematic approach (uses UTC) if you are using Spark SQL

# Problematic approach (uses UTC) if you are using Spark

How to fix it:

Here’s how to set up your time zone for Spark:

from datetime import datetime

local_tz = pytz.timezone( 'America/New_York' ) # or your local time zone

Here’s how to set timezone in Spark SQL:

%run is convenient and suitable for quickly loading variables or libraries

Integrated Environment: Effortlessly orchestrate notebooks, jobs, and tasks within a

5. Forgetting to set threshold alerts or monitoring on Databricks Workflows.

Mistake: Forgetting to set threshold alerts or monitoring parameters in Databricks workflows.

6. Ensure Your Team Receives Error Notifications in Databricks Workflows.

Databricks offers two main types of clusters:

How to fix it:

8. Setting up interactive Databricks clusters without auto- termination enabled.

Mistake: Setting up interactive Databricks clusters without enabling auto- termination

VACUUM table_name RETAIN 168 HOURS; -- retain 7 days

VACUUM table_name RETAIN 1440 HOURS; -- retain 60 days

my_password = dbutils.secrets.get(scope="my-keyvault-scope", key="my-db-password

11. Connecting directly to on-prem databases from cloud Notebooks.

How to fix it:

Moving customer PII, financial data, or healthcare information in plaintext.

How to fix it:

1. Implement Strong Encryption Standards:

encrypted_df = customer_df.withColumn("SSN_encrypted", sha2(col("SSN"),256))

2. Utilize Data Masking and Anonymization for Non-production Environments:

For testing, development, or analytics environments, mask or anonymize sensitive data to

14. Leaving PII Data Exposed in Files During Data Transfers.

Databricks. Common oversights include:

Do you recognize yourself in any of these mistakes?

You might also like