Thanks to visit codestin.com
Credit goes to github.com

Skip to content

gabrieldhofer/data-eng-exercise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-engineering-assessment

Table of Contents

  1. Description
  2. Usage
  3. Scheduling

Description

Given the CMS provider data metastore, write a script that downloads all data sets related to the theme "Hospitals".

The column names in the csv headers are currently in mixed case with spaces and special characters. Convert all column names to snake_case. (Example: "Patients' rating of the faciilty linear mean score" becomes "patients_rating_of_the_facility_linear_mean_score").

The csv files should be downloaded and processed in parallel, and the job should be designed to run every day, but only download files that have been modified since the previous run (need to track runs/metadata).

https://data.cms.gov/provider-data/api/1/metastore/schemas/dataset/items

Usage:

This program was developed in Google Colab.

Install dependencies:

    $ pip install -r ./requirements.txt

    $ make clean

    $ make init

Then, run the "make" command to run main.py.

    $ make

Scheduling

    def main(data_location="metadata.parquet"):
      schedule.every().day.at("20:43:00", timezone("America/Chicago")).do(job)
      while True:
        schedule.run_pending()
        time.sleep(1)

Job

    def job():
      """ Author: Gabriel Hofer """
      spark = SparkSession.builder.getOrCreate()
      tgt_df = read_tgt_df(spark)
      src_df = get_data(schema_camel)
      filtered = filter_by_hospitals_theme(src_df)
      case_converted = cols_to_snake_case(filtered)
      new_tgt_df = upsert(tgt_df, case_converted)
      write_tgt_df(new_tgt_df)
      spark.stop()

Reading & Writing DataFrames

    def read_tgt_df(spark, data_location="metadata.parquet"):
      return spark.read.schema(schema_snake).parquet(data_location)
    
    def write_tgt_df(tgt_df, data_location="metadata.parquet"):
      tgt_df.write.parquet(data_location, mode="overwrite", compression="snappy")

Convert cols to Snake Case

    def cols_to_snake_case(df) -> None:
      """ convert column names to snake case """
      for col in df.columns:
        new_col = re.sub(r"(?<!^)(?=[A-Z])", "_", col).lower()
        df = df.withColumnRenamed(col, new_col)
      return df

About

Practice with PySpark, requests libraries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published