feat - add comprehensive HLS synthetic data notebook (#356)#434
feat - add comprehensive HLS synthetic data notebook (#356)#434Vsatyam013 wants to merge 1 commit into
Conversation
|
Hi @suryasaitura-db, I wanted to follow up on this PR. The HLS synthetic data notebook is complete and resolves #356 — it generates realistic interconnected healthcare data including 500K patients, 5M encounters, 15M ICD-10 diagnoses, 20M lab results with LOINC codes, and 3M insurance claims, along with built-in analytical queries for readmission risk, medication non-adherence, and claim denial rates. Would love to get your feedback or any change requests so I can get this ready to merge. Happy to make any adjustments you need! |
There was a problem hiding this comment.
Pull request overview
Adds a new Healthcare & Life Sciences (HLS) synthetic data generation demo under examples, using dbldatagen to generate multiple healthcare datasets and run sample analytical queries.
Changes:
- Added a large Databricks-style notebook (exported as
.py) that generates patients, providers, encounters, diagnoses, medications, labs, and claims plus example analytics. - Updated
CHANGELOG.mdto note the new HLS demo.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| examples/healthcare_datagen_demo.py | New end-to-end HLS synthetic data generation demo with multiple datasets and analytical queries. |
| CHANGELOG.md | Adds an unreleased changelog entry for the new HLS example. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # COMMAND ---------- | ||
| # MAGIC %pip install dbldatagen | ||
| dbutils.library.restartPython() | ||
|
|
||
| # COMMAND ---------- | ||
| import dbldatagen as dg | ||
| from dbldatagen import DataGenerator | ||
| from pyspark.sql import functions as F |
| .withColumn("patient_id", "string", | ||
| expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))") | ||
| .withColumn("provider_id", "string", | ||
| expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))") | ||
| .withColumn("facility_id", "string", template=r"FAC-ddddd", random=True) |
| .withColumn("encounter_id", "string", | ||
| expr="concat('ENC-', cast(cast(rand() * 4999999 as int) + 1 as string))") | ||
| .withColumn("patient_id", "string", | ||
| expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))") | ||
| .withColumn("icd10_code", "string", values=icd10_codes, random=True) |
| .withColumn("icd10_code", "string", values=icd10_codes, random=True) | ||
| .withColumn("icd10_description", "string", values=icd10_descriptions, random=True) | ||
| .withColumn("diagnosis_type", "string", |
| .withColumn("loinc_code", "string", values=loinc_codes, random=True) | ||
| .withColumn("test_name", "string", values=loinc_names, random=True) | ||
| .withColumn("result_value", "double", |
| .withColumn("cpt_code", "string", values=cpt_codes, random=True) | ||
| .withColumn("cpt_description", "string", values=cpt_descriptions, random=True) |
| print(f"Prescriptions generated: {medications_df.count():,}") | ||
| non_adherent = medications_df.filter("NOT is_adherent").count() | ||
| total_rx = medications_df.count() | ||
| print(f"Non-adherence rate: {non_adherent/total_rx:.1%}") |
| print(f"Lab results generated: {labs_df.count():,}") | ||
| abnormal_pct = labs_df.filter("is_abnormal").count() / labs_df.count() | ||
| critical_pct = labs_df.filter("critical_flag").count() / labs_df.count() | ||
| print(f"Abnormal rate: {abnormal_pct:.1%} | Critical rate: {critical_pct:.1%}") |
ghanse
left a comment
There was a problem hiding this comment.
Very nice contribution. Left some minor feedback. Looking good overall.
There was a problem hiding this comment.
We generate updates to CHANGELOG with a tagged release PR. Can you revert this change?
|
|
||
| # COMMAND ---------- | ||
| import dbldatagen as dg | ||
| from dbldatagen import DataGenerator |
There was a problem hiding this comment.
We import the full namespace so probably best to just use dg.DataGenerator?
| .withColumn("registration_date", "date", | ||
| begin="2000-01-01", end="2024-06-01", random=True) |
There was a problem hiding this comment.
Maybe use expr here to demo expressions and ensure registration_date is after date_of_birth.
| .withColumn("credential", "string", | ||
| values=["MD","DO","NP","PA","RN","PhD","DDS"], | ||
| weights=[50, 15, 15, 10, 5, 3, 2]) | ||
| .withColumn("specialty", "string", values=specialties, random=True) | ||
| .withColumn("sub_specialty", "string", | ||
| values=["General","Interventional","Pediatric","Geriatric","Surgical","None"], | ||
| weights=[40, 15, 15, 10, 10, 10]) |
There was a problem hiding this comment.
Would be nice to show how to realistically generate the credential and/or sub_specialty from the specialty using something like CASE WHEN expressions.
Shrinking the list of specialties or creating a few special cases and letting others fall into the default case should reduce the number of conditional statements required.
| .withColumn("patient_id", "string", | ||
| expr="concat('PAT-', cast(cast(rand() * 499999 as int) + 1 as string))") | ||
| .withColumn("provider_id", "string", | ||
| expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))") |
There was a problem hiding this comment.
Would be nice to reuse PATIENT_COUNT and PROVIDER_COUNT instead of hard-coding so the foreign keys stay valid if the user changes the counts.
| expr="concat('PRV-', cast(cast(rand() * 49999 as int) + 1 as string))") | ||
| .withColumn("drug_name", "string", values=medications, random=True) | ||
| .withColumn("ndc_code", "string", values=ndc_prefixes, random=True) | ||
| .withColumn("dose_mg", "double", |
There was a problem hiding this comment.
Maybe rename to "dose" since the next column specifies the unit of measure?
| .withColumn("loinc_code", "string", values=loinc_codes, random=True) | ||
| .withColumn("test_name", "string", values=loinc_names, random=True) |
There was a problem hiding this comment.
These can also be correlated.
| .withColumn("reference_range_low", "double", | ||
| values=ref_range_low, random=True) | ||
| .withColumn("reference_range_high","double", | ||
| values=ref_range_high, random=True) |
There was a problem hiding this comment.
reference_range_low might be greater than reference_range_high in this case. I suppose we need to either correlate these values or use an expression with some randomness.
| .withColumn("cpt_code", "string", values=cpt_codes, random=True) | ||
| .withColumn("cpt_description", "string", values=cpt_descriptions, random=True) |
| # COMMAND ---------- | ||
| # MAGIC %md | ||
| # MAGIC ## Optional — Persist to Delta Tables | ||
|
|
||
| # COMMAND ---------- | ||
|
|
||
| # Uncomment to write all datasets to Delta (requires a catalog / schema to exist) | ||
| # | ||
| # TARGET_CATALOG = "main" | ||
| # TARGET_SCHEMA = "healthcare_synthetic" | ||
| # | ||
| # spark.sql(f"CREATE SCHEMA IF NOT EXISTS {TARGET_CATALOG}.{TARGET_SCHEMA}") | ||
| # | ||
| # datasets = { | ||
| # "patients": patients_df, | ||
| # "providers": providers_df, | ||
| # "encounters": encounters_df, | ||
| # "diagnoses": diagnoses_df, | ||
| # "medications": medications_df, | ||
| # "lab_results": labs_df, | ||
| # "insurance_claims": claims_df, | ||
| # } | ||
| # | ||
| # for table_name, df in datasets.items(): | ||
| # full_name = f"{TARGET_CATALOG}.{TARGET_SCHEMA}.{table_name}" | ||
| # df.write.format("delta").mode("overwrite").saveAsTable(full_name) | ||
| # print(f"Saved {full_name}") |
There was a problem hiding this comment.
Let's remove this part. It's nice for the demos to simply generate the data without writing it.
Changes
Adds a comprehensive Healthcare and Life Sciences (HLS) industry demo to the
examplesdirectory. Thehealthcare_datagen_demo.pyscript utilizesdbldatagento generate highly realistic, interconnected synthetic data for healthcare systems, including:Linked issues
Resolves #356
Tests
Documentation and Demos