A Python library for validating files in the .pb (Pabulib) format, ensuring compliance with the standards described at pabulib.org/format.
pip install git+https://github.com/pabulib/checker.git
- pycountry should be installed
- tests should be run before deployment - CI/CD
- Should add correct order (to change it) not the actual one
The Checker is a utility for processing and validating .pb files. It performs a wide range of checks to ensure data consistency across meta, projects, and votes sections. We are very open for any code suggestions / changes.
- Budget Validation: Ensures that project costs align with the defined budget and checks for overages.
- Vote and Project Count Validation: Cross-verifies counts in metadata against actual data.
- Vote Length Validation: Validates that each voter’s submissions comply with minimum and maximum limits.
- Duplicate Votes Detection: Identifies repeated votes within individual submissions.
- Project Selection Validation: Ensures compliance with defined selection rules, such as Poznań or greedy algorithms.
- Field Structure Validation: Verifies field presence, order, types, and constraints in metadata, projects, and votes.
- Date Range Validation: Checks that metadata contains a valid date range.
The results from the validation process include three main sections:
Tracks the overall processing statistics:
processed: Total number of files processed.valid: Count of valid files.invalid: Count of invalid files.
Provides aggregated error and warning counts by type for all processed files. Example:
{
"empty lines": 3,
"comma in float!": 2,
"budget exceeded": 1
}Details the outcomes for each processed file. Includes:
webpage_name: Generated name based on metadata.results:File looks correct!if no errors or warnings.- Detailed errors and warnings if issues are found.
{
"metadata": {
"processed": 1,
"valid": 1,
"invalid": 0
},
"summary": {},
"file1": {
"webpage_name": "Country_Unit_Instance_Subunit",
"results": "File looks correct!"
}
}{
"metadata": {
"processed": 1,
"valid": 0,
"invalid": 1
},
"summary": {
"empty lines": 1,
"comma in float!": 1
},
"file1": {
"webpage_name": "Country_Unit_Instance_Subunit",
"results": {
"errors": {
"empty lines": {
1: "contains empty lines at: [10, 20]"
},
"comma in float!": {
1: "in budget"
}
},
"warnings": {
"wrong projects fields order": {
1: "projects wrong fields order: ['name', 'cost', 'selected']."
}
}
}
}
}Critical issues that need to be fixed:
- Empty Lines:
contains empty lines at: [line_numbers] - Comma in Float:
comma in float value at {field} - Project with No Cost:
project: {project_id} has no cost! - Single Project Exceeded Whole Budget:
project {project_id} has exceeded the whole budget! - Budget Exceeded:
Budget exceeded by selected projects - Fully Funded Flag Discrepancy:
fully_funded flag different than 1! - Unused Budget:
Unused budget could fund project: {project_id} - Different Number of Votes:
votes number in META: {meta_votes} vs counted from file: {file_votes} - Different Number of Projects:
projects number in META: {meta_projects} vs counted from file: {file_projects} - Vote with Duplicated Projects:
duplicated projects in a vote: {voter_id} - Vote Length Exceeded:
Voter ID: {voter_id}, max vote length exceeded - Vote Length Too Short:
Voter ID: {voter_id}, min vote length not met - Different Values in Votes:
file votes vs counted votes mismatch for project: {project_id} - Different Values in Scores:
file scores vs counted scores mismatch for project: {project_id} - No Votes or Scores in Projects:
No votes or scores found in PROJECTS section - Invalid Field Value:
field '{field_name}' has invalid value
Non-critical issues that should be reviewed:
- Wrong Field Order:
{section_name} contains fields in wrong order: {fields_list} - Poznań Rule Not Followed:
Projects not selected but should be: {project_ids} - Greedy Rule Not Followed:
Projects selected but should not: {project_ids}
-
Ensure all dependencies are installed:
- Python 3.8+
- Required modules:
pycountry
pip install -r requirements.txt
Install as a python package directly from github:
pip install git+https://github.com/pabulib/checker.git
-
Import the
Checkerclass:from pabulib.checker import Checker
-
Instantiate the
Checkerclass:checker = Checker()
-
Process Files: You can use
process_filesmethod which takes a list of path to files or their contents.files = ["path/to/file1.pb", "raw content of file2"] results = checker.process_files(files)
-
Get the results: ATM results is a python dict
import json # for a summary, errors accross all files print(json.dumps(results["summary"], indent=4)) # processing metadata, how many files were processed etc print(json.dumps(results["metadata"], indent=4)) print(results) # to get details. # for example print(results[<file_name>])
You can process example .pb files using the script examples/run_examples.py. This script demonstrates how to use the Checker to validate files.
-
Example files are located in the
examples/directory:example_valid.pb: A valid.pbfile.example_invalid.pb: A.pbfile containing errors.
-
Run the script:
python examples/run_examples.py- The results for both valid and invalid files will be printed in JSON format.
To add new validation rules or checks:
- Define a new method in the
Checkerclass. - Integrate it into the
run_checksmethod for sequential execution.
For detailed examples or advanced usage, refer to the comments in the source code.