Codestin Search App

Vsatyam013 · 2026-05-30T05:32:40Z

Changes

Adds a comprehensive Healthcare and Life Sciences (HLS) industry demo to the examples directory. The healthcare_datagen_demo.py script utilizes dbldatagen to generate highly realistic, interconnected synthetic data for healthcare systems, including:

500K Patients with demographics, insurance, and blood types
50K Providers with specialties and NPIs
5M Encounters (hospital visits, ER, telehealth)
15M Diagnoses (ICD-10 codes) and 10M Medications (NDC codes)
20M Lab Results with LOINC codes and abnormal reference flags
3M Insurance Claims with CPT codes, billed amounts, and denial reasons
Built-in analytical queries demonstrating business value (readmission risk, medication non-adherence, and claim denial rates)

Linked issues

Resolves #356

Tests

manually tested local script execution

Documentation and Demos

added/updated demos

)

Vsatyam013 · 2026-05-31T10:20:14Z

Hi @suryasaitura-db, I wanted to follow up on this PR. The HLS synthetic data notebook is complete and resolves #356 — it generates realistic interconnected healthcare data including 500K patients, 5M encounters, 15M ICD-10 diagnoses, 20M lab results with LOINC codes, and 3M insurance claims, along with built-in analytical queries for readmission risk, medication non-adherence, and claim denial rates.

Would love to get your feedback or any change requests so I can get this ready to merge. Happy to make any adjustments you need!

Copilot

Pull request overview

Adds a new Healthcare & Life Sciences (HLS) synthetic data generation demo under examples, using dbldatagen to generate multiple healthcare datasets and run sample analytical queries.

Changes:

Added a large Databricks-style notebook (exported as .py) that generates patients, providers, encounters, diagnoses, medications, labs, and claims plus example analytics.
Updated CHANGELOG.md to note the new HLS demo.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
examples/healthcare_datagen_demo.py	New end-to-end HLS synthetic data generation demo with multiple datasets and analytical queries.
CHANGELOG.md	Adds an unreleased changelog entry for the new HLS example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# COMMAND ----------
+# MAGIC %pip install dbldatagen
+dbutils.library.restartPython()
+
+# COMMAND ----------
+import dbldatagen as dg
+from dbldatagen import DataGenerator
+from pyspark.sql import functions as F


+    .withColumn("patient_id",          "string",
+                expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
+    .withColumn("provider_id",         "string",
+                expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")
+    .withColumn("facility_id",         "string", template=r"FAC-ddddd", random=True)


+    .withColumn("encounter_id",        "string",
+                expr="concat('ENC-', cast(cast(rand() * 4999999 as int) + 1 as string))")
+    .withColumn("patient_id",          "string",
+                expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
+    .withColumn("icd10_code",          "string", values=icd10_codes, random=True)


+    .withColumn("icd10_code",          "string", values=icd10_codes, random=True)
+    .withColumn("icd10_description",   "string", values=icd10_descriptions, random=True)
+    .withColumn("diagnosis_type",      "string",


+    .withColumn("loinc_code",          "string", values=loinc_codes, random=True)
+    .withColumn("test_name",           "string", values=loinc_names, random=True)
+    .withColumn("result_value",        "double",


+    .withColumn("cpt_code",            "string", values=cpt_codes, random=True)
+    .withColumn("cpt_description",     "string", values=cpt_descriptions, random=True)


ghanse · 2026-06-03T14:24:51Z

+print(f"Prescriptions generated: {medications_df.count():,}")
+non_adherent = medications_df.filter("NOT is_adherent").count()
+total_rx = medications_df.count()
+print(f"Non-adherence rate: {non_adherent/total_rx:.1%}")


Good suggestion

ghanse · 2026-06-03T14:24:36Z

+print(f"Lab results generated: {labs_df.count():,}")
+abnormal_pct = labs_df.filter("is_abnormal").count() / labs_df.count()
+critical_pct = labs_df.filter("critical_flag").count() / labs_df.count()
+print(f"Abnormal rate: {abnormal_pct:.1%}  |  Critical rate: {critical_pct:.1%}")


This is a good suggestion

ghanse

Very nice contribution. Left some minor feedback. Looking good overall.

ghanse · 2026-06-02T23:05:35Z

We generate updates to CHANGELOG with a tagged release PR. Can you revert this change?

ghanse · 2026-06-02T23:07:51Z

+
+# COMMAND ----------
+import dbldatagen as dg
+from dbldatagen import DataGenerator


We import the full namespace so probably best to just use dg.DataGenerator?

ghanse · 2026-06-02T23:21:37Z

+    .withColumn("registration_date",   "date",
+                begin="2000-01-01", end="2024-06-01", random=True)


Maybe use expr here to demo expressions and ensure registration_date is after date_of_birth.

ghanse · 2026-06-02T23:28:41Z

+    .withColumn("credential",          "string",
+                values=["MD","DO","NP","PA","RN","PhD","DDS"],
+                weights=[50, 15, 15, 10, 5, 3, 2])
+    .withColumn("specialty",           "string", values=specialties, random=True)
+    .withColumn("sub_specialty",       "string",
+                values=["General","Interventional","Pediatric","Geriatric","Surgical","None"],
+                weights=[40, 15, 15, 10, 10, 10])


Would be nice to show how to realistically generate the credential and/or sub_specialty from the specialty using something like CASE WHEN expressions.

Shrinking the list of specialties or creating a few special cases and letting others fall into the default case should reduce the number of conditional statements required.

ghanse · 2026-06-02T23:30:27Z

+    .withColumn("patient_id",          "string",
+                expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))")
+    .withColumn("provider_id",         "string",
+                expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")


Would be nice to reuse PATIENT_COUNT and PROVIDER_COUNT instead of hard-coding so the foreign keys stay valid if the user changes the counts.

ghanse · 2026-06-02T23:39:21Z

+                expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))")
+    .withColumn("drug_name",           "string", values=medications, random=True)
+    .withColumn("ndc_code",            "string", values=ndc_prefixes, random=True)
+    .withColumn("dose_mg",             "double",


Maybe rename to "dose" since the next column specifies the unit of measure?

ghanse · 2026-06-02T23:40:55Z

+    .withColumn("loinc_code",          "string", values=loinc_codes, random=True)
+    .withColumn("test_name",           "string", values=loinc_names, random=True)


These can also be correlated.

ghanse · 2026-06-02T23:42:45Z

+    .withColumn("reference_range_low", "double",
+                values=ref_range_low, random=True)
+    .withColumn("reference_range_high","double",
+                values=ref_range_high, random=True)


reference_range_low might be greater than reference_range_high in this case. I suppose we need to either correlate these values or use an expression with some randomness.

ghanse · 2026-06-02T23:43:46Z

+    .withColumn("cpt_code",            "string", values=cpt_codes, random=True)
+    .withColumn("cpt_description",     "string", values=cpt_descriptions, random=True)


Correlate the values

ghanse · 2026-06-02T23:49:09Z

+# COMMAND ----------
+# MAGIC %md
+# MAGIC ## Optional — Persist to Delta Tables
+
+# COMMAND ----------
+
+# Uncomment to write all datasets to Delta (requires a catalog / schema to exist)
+#
+# TARGET_CATALOG = "main"
+# TARGET_SCHEMA  = "healthcare_synthetic"
+#
+# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {TARGET_CATALOG}.{TARGET_SCHEMA}")
+#
+# datasets = {
+#     "patients":         patients_df,
+#     "providers":        providers_df,
+#     "encounters":       encounters_df,
+#     "diagnoses":        diagnoses_df,
+#     "medications":      medications_df,
+#     "lab_results":      labs_df,
+#     "insurance_claims": claims_df,
+# }
+#
+# for table_name, df in datasets.items():
+#     full_name = f"{TARGET_CATALOG}.{TARGET_SCHEMA}.{table_name}"
+#     df.write.format("delta").mode("overwrite").saveAsTable(full_name)
+#     print(f"Saved {full_name}")


Let's remove this part. It's nice for the demos to simply generate the data without writing it.

feat - add comprehensive HLS synthetic data notebook (databrickslabs#356

1ff0811

)

Vsatyam013 marked this pull request as ready for review May 31, 2026 08:18

Vsatyam013 requested review from a team as code owners May 31, 2026 08:18

Vsatyam013 requested review from suryasaitura-db and removed request for a team May 31, 2026 08:18

ghanse requested review from Copilot and ghanse June 2, 2026 23:04

Copilot started reviewing on behalf of ghanse June 2, 2026 23:05 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

ghanse requested changes Jun 2, 2026

View reviewed changes

ghanse added the under-review label Jun 4, 2026

		.withColumn("cpt_code", "string", values=cpt_codes, random=True)
		.withColumn("cpt_description", "string", values=cpt_descriptions, random=True)

		.withColumn("registration_date", "date",
		begin="2000-01-01", end="2024-06-01", random=True)

		.withColumn("loinc_code", "string", values=loinc_codes, random=True)
		.withColumn("test_name", "string", values=loinc_names, random=True)

Conversation

Vsatyam013 commented May 30, 2026

Changes

Linked issues

Tests

Documentation and Demos

Uh oh!

Vsatyam013 commented May 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghanse left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants