A Python package for running data quality checks on any Common Data Model.
It provides utilities to validate data integrity rules, such as constraints and relationships, with structured logging and result summaries. The package is designed to be extensible, allowing additional data quality tests to be added as needed.
-
Clone the repository:
git clone https://github.com/PEDSnet/infomodels-duckdb.git cd infomodels-duckdb -
Edit the configuration:
Copy the template configuration file and update it as needed:
cp config.yml.docker_template config.yml
Then, edit
config.ymlto match your submission and site file format. -
Build the Docker image:
docker build -t infomodels-duckdb . -
Run the main script in a container:
docker run --rm -it \ -v PATH_TO_YOUR_CDM_DIR:/data \ -v PATH_TO_YOUR_RESULT_DIR:/result \ infomodels-duckdb
-
Clone the repository:
git clone https://github.com/PEDSnet/infomodels-duckdb.git cd infomodels-duckdb -
Activate virtual environment and install dependencies (optional, but recommended):
This isolates your Python environment for the project. Alternatively, you may install the package and its dependencies system-wide if preferred.python -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Edit the configuration:
Copy the template configuration file and update it as needed:
cp config.yml.standalone_template config.yml
Then, edit
config.ymlto match your submission and site file format. -
Run the main script:
python -m src.main
The following data quality checks are currently supported:
- Missing Submission File: Detects required files that are missing from the submission.
- Extra Submission File: Detects unexpected files present in the submission.
- Duplicated Column in CSV: Identifies duplicate column names in CSV headers.
- Extra Column in CSV: Flags columns in CSV files that are not defined in the data model.
- Missing Column in CSV: Flags columns defined in the data model that are missing from the CSV file.
- Data Type: The data types in the CSV files conform to the column definitions specified in the CDM.
- NOT NULL Violation: Ensures specified columns do not contain NULL values.
- Distinct Violation: Ensures specified columns (or combinations) contain only unique values.
- Primary Key Violation: Checks that primary key columns are both NOT NULL and unique.
- Foreign Key Violation: Checks that values in a main table reference valid values in a related table.
More checks will be added.