Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat - add comprehensive HLS synthetic data notebook (#356)#434

Open
Vsatyam013 wants to merge 1 commit into
databrickslabs:masterfrom
Vsatyam013:feat/356-hls-demo
Open

feat - add comprehensive HLS synthetic data notebook (#356)#434
Vsatyam013 wants to merge 1 commit into
databrickslabs:masterfrom
Vsatyam013:feat/356-hls-demo

Conversation

@Vsatyam013
Copy link
Copy Markdown

Changes

Adds a comprehensive Healthcare and Life Sciences (HLS) industry demo to the examples directory. The healthcare_datagen_demo.py script utilizes dbldatagen to generate highly realistic, interconnected synthetic data for healthcare systems, including:

  • 500K Patients with demographics, insurance, and blood types
  • 50K Providers with specialties and NPIs
  • 5M Encounters (hospital visits, ER, telehealth)
  • 15M Diagnoses (ICD-10 codes) and 10M Medications (NDC codes)
  • 20M Lab Results with LOINC codes and abnormal reference flags
  • 3M Insurance Claims with CPT codes, billed amounts, and denial reasons
  • Built-in analytical queries demonstrating business value (readmission risk, medication non-adherence, and claim denial rates)

Linked issues

Resolves #356

Tests

  • manually tested local script execution

Documentation and Demos

  • added/updated demos

@Vsatyam013 Vsatyam013 marked this pull request as ready for review May 31, 2026 08:18
@Vsatyam013 Vsatyam013 requested review from a team as code owners May 31, 2026 08:18
@Vsatyam013 Vsatyam013 requested review from suryasaitura-db and removed request for a team May 31, 2026 08:18
@Vsatyam013
Copy link
Copy Markdown
Author

Hi @suryasaitura-db, I wanted to follow up on this PR. The HLS synthetic data notebook is complete and resolves #356 — it generates realistic interconnected healthcare data including 500K patients, 5M encounters, 15M ICD-10 diagnoses, 20M lab results with LOINC codes, and 3M insurance claims, along with built-in analytical queries for readmission risk, medication non-adherence, and claim denial rates.

Would love to get your feedback or any change requests so I can get this ready to merge. Happy to make any adjustments you need!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Healthcare & Life Sciences (HLS) synthetic data generation demo under examples, using dbldatagen to generate multiple healthcare datasets and run sample analytical queries.

Changes:

  • Added a large Databricks-style notebook (exported as .py) that generates patients, providers, encounters, diagnoses, medications, labs, and claims plus example analytics.
  • Updated CHANGELOG.md to note the new HLS demo.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
examples/healthcare_datagen_demo.py New end-to-end HLS synthetic data generation demo with multiple datasets and analytical queries.
CHANGELOG.md Adds an unreleased changelog entry for the new HLS example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +33
# COMMAND ----------
# MAGIC %pip install dbldatagen
dbutils.library.restartPython()

# COMMAND ----------
import dbldatagen as dg
from dbldatagen import DataGenerator
from pyspark.sql import functions as F
Comment on lines +198 to +202
.withColumn("patient_id", "string",
expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
.withColumn("provider_id", "string",
expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")
.withColumn("facility_id", "string", template=r"FAC-ddddd", random=True)
Comment on lines +314 to +318
.withColumn("encounter_id", "string",
expr="concat('ENC-', cast(cast(rand() * 4999999 as int) + 1 as string))")
.withColumn("patient_id", "string",
expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
.withColumn("icd10_code", "string", values=icd10_codes, random=True)
Comment on lines +318 to +320
.withColumn("icd10_code", "string", values=icd10_codes, random=True)
.withColumn("icd10_description", "string", values=icd10_descriptions, random=True)
.withColumn("diagnosis_type", "string",
Comment on lines +459 to +461
.withColumn("loinc_code", "string", values=loinc_codes, random=True)
.withColumn("test_name", "string", values=loinc_names, random=True)
.withColumn("result_value", "double",
Comment on lines +569 to +570
.withColumn("cpt_code", "string", values=cpt_codes, random=True)
.withColumn("cpt_description", "string", values=cpt_descriptions, random=True)
Comment on lines +420 to +423
print(f"Prescriptions generated: {medications_df.count():,}")
non_adherent = medications_df.filter("NOT is_adherent").count()
total_rx = medications_df.count()
print(f"Non-adherence rate: {non_adherent/total_rx:.1%}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

Comment on lines +499 to +502
print(f"Lab results generated: {labs_df.count():,}")
abnormal_pct = labs_df.filter("is_abnormal").count() / labs_df.count()
critical_pct = labs_df.filter("critical_flag").count() / labs_df.count()
print(f"Abnormal rate: {abnormal_pct:.1%} | Critical rate: {critical_pct:.1%}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion

Copy link
Copy Markdown
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice contribution. Left some minor feedback. Looking good overall.

Comment thread CHANGELOG.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generate updates to CHANGELOG with a tagged release PR. Can you revert this change?


# COMMAND ----------
import dbldatagen as dg
from dbldatagen import DataGenerator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We import the full namespace so probably best to just use dg.DataGenerator?

Comment on lines +104 to +105
.withColumn("registration_date", "date",
begin="2000-01-01", end="2024-06-01", random=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use expr here to demo expressions and ensure registration_date is after date_of_birth.

Comment on lines +151 to +157
.withColumn("credential", "string",
values=["MD","DO","NP","PA","RN","PhD","DDS"],
weights=[50, 15, 15, 10, 5, 3, 2])
.withColumn("specialty", "string", values=specialties, random=True)
.withColumn("sub_specialty", "string",
values=["General","Interventional","Pediatric","Geriatric","Surgical","None"],
weights=[40, 15, 15, 10, 10, 10])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to show how to realistically generate the credential and/or sub_specialty from the specialty using something like CASE WHEN expressions.

Shrinking the list of specialties or creating a few special cases and letting others fall into the default case should reduce the number of conditional statements required.

Comment on lines +198 to +201
.withColumn("patient_id", "string",
expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
.withColumn("provider_id", "string",
expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to reuse PATIENT_COUNT and PROVIDER_COUNT instead of hard-coding so the foreign keys stay valid if the user changes the counts.

expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")
.withColumn("drug_name", "string", values=medications, random=True)
.withColumn("ndc_code", "string", values=ndc_prefixes, random=True)
.withColumn("dose_mg", "double",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to "dose" since the next column specifies the unit of measure?

Comment on lines +459 to +460
.withColumn("loinc_code", "string", values=loinc_codes, random=True)
.withColumn("test_name", "string", values=loinc_names, random=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can also be correlated.

Comment on lines +467 to +470
.withColumn("reference_range_low", "double",
values=ref_range_low, random=True)
.withColumn("reference_range_high","double",
values=ref_range_high, random=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference_range_low might be greater than reference_range_high in this case. I suppose we need to either correlate these values or use an expression with some randomness.

Comment on lines +569 to +570
.withColumn("cpt_code", "string", values=cpt_codes, random=True)
.withColumn("cpt_description", "string", values=cpt_descriptions, random=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correlate the values

Comment on lines +706 to +732
# COMMAND ----------
# MAGIC %md
# MAGIC ## Optional — Persist to Delta Tables

# COMMAND ----------

# Uncomment to write all datasets to Delta (requires a catalog / schema to exist)
#
# TARGET_CATALOG = "main"
# TARGET_SCHEMA = "healthcare_synthetic"
#
# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {TARGET_CATALOG}.{TARGET_SCHEMA}")
#
# datasets = {
# "patients": patients_df,
# "providers": providers_df,
# "encounters": encounters_df,
# "diagnoses": diagnoses_df,
# "medications": medications_df,
# "lab_results": labs_df,
# "insurance_claims": claims_df,
# }
#
# for table_name, df in datasets.items():
# full_name = f"{TARGET_CATALOG}.{TARGET_SCHEMA}.{table_name}"
# df.write.format("delta").mode("overwrite").saveAsTable(full_name)
# print(f"Saved {full_name}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this part. It's nice for the demos to simply generate the data without writing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Industry related demos - Healthcare

3 participants