A brisk, no-nonsense tool for anonymising AWS Cost and Usage Report (CUR) Parquet files. Remove or mask sensitive details, preserve data utility, and share your reports without breaking a sweat—or a compliance rule.
CUR2:
- Generate config:
python python/cur2anonymiser.py --input rawcur2.parquet --create-config --config config_cur2.json - Anonymise:
python python/cur2anonymiser.py --input rawcur2.parquet --output anonymisedcur2.parquet --config config_cur2.json - Anonymise to CSV:
python python/cur2anonymiser.py --input rawcur2.parquet --output anonymisedcur2.csv --config config_cur2.json
Legacy CUR:
- Generate config:
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --create-config --config config_legacy.json - Anonymise:
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --output anonymisedcur_legacy.parquet --config config_legacy.json - Anonymise to CSV:
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --output anonymisedcur_legacy.csv --config config_legacy.json
Focus (Generic - works for any table format):
- Generate config:
python python/focusanonymiser.py --input rawdata.parquet --create-config --config config_focus.json - Anonymise:
python python/focusanonymiser.py --input rawdata.parquet --output anonymised.parquet --config config_focus.json - Anonymise to CSV:
python python/focusanonymiser.py --input rawdata.parquet --output anonymised.csv --config config_focus.json
- Supports both legacy AWS CUR, CUR2, and generic Focus (Azure-style) formats
- Reads and writes Parquet (and CSV) via DuckDB; no Spark, no Java, no drama
- Anonymises AWS Account IDs and ARNs, ensuring consistency across the dataset
- Lets you hash columns, drop columns, or keep them as you fancy
- Simple, human-editable JSON config, easy to share, easy to tweak
- Auto-generates a config file from your Parquet columns
- CLI with helpful flags and no unnecessary faff
- MIT licensed, open source
- In CUR2, the
resource_tagscolumn is a single column containing JSON (not split into multiple columns); by default, it is hashed for anonymisation - In Focus format, columns like
BillingAccountId,BillingAccountName,SubAccountId,SubAccountIdName,InvoiceId, andtagare hashed by default; all others are kept unless changed in the config
pip install -r requirements.txt- CUR2: All-lowercase, underscore-separated columns (e.g.,
line_item_blended_cost) - Legacy CUR: Mixed case, slashes, or camelCase columns (e.g.,
lineItem/UsageAccountId) - Focus: Generic tabular format (often Azure), e.g., columns like
BillingAccountId,ServiceName,Tags, etc.
For CUR2:
python python/cur2anonymiser.py --input rawcur2.parquet --create-config --config config_cur2.jsonFor Legacy CUR:
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --create-config --config config_legacy.jsonFor Focus:
python python/focusanonymiser.py --input rawdata.parquet --create-config --config config_focus.jsonThis produces a config listing all columns and their suggested actions. Edit it to choose which columns to keep, remove, anonymise, or hash.
Each column can be set to one of:
keep– leave the column untouchedremove– drop the column entirelyawsid_anonymise– swap for a fake, consistent 12-digit AWS account IDawsarn_anonymise– swap for a fake ARN, using the fake account IDhash– scramble the column with DuckDB’smd5_number_upper, so the same value always produces the same hash, but there is no way back—perfect for secrets, not for magicians.uuid– replace the column value with a deterministic UUID (same input = same output, not reversible)
For CUR2:
python python/cur2anonymiser.py --input rawcur2.parquet --output anonymisedcur2.parquet --config config_cur2.jsonFor Legacy CUR:
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --output anonymisedcur_legacy.parquet --config config_legacy.jsonFor Focus:
python python/focusanonymiser.py --input rawdata.parquet --output anonymised.parquet --config config_focus.jsonOr, if you prefer CSV:
python python/cur2anonymiser.py --input rawcur2.parquet --output anonymisedcur2.csv --config config_cur2.json
python python/curanonymiser_legacy.py --input rawcur_legacy.parquet --output anonymisedcur_legacy.csv --config config_legacy.json
python python/focusanonymiser.py --input rawdata.parquet --output anonymised.csv --config config_focus.jsonVoilà! Your anonymised file is ready for sharing, analysis, or waving triumphantly at your compliance officer.
{
"_comment": "Column options: 'keep', 'remove', 'awsid_anonymise', 'awsarn_anonymise', 'hash', 'uuid'",
"columns": {
"line_item_usage_account_id": "awsid_anonymise",
"bill_payer_account_id": "awsid_anonymise",
"line_item_resource_id": "awsarn_anonymise",
"product_instance_type": "remove",
"product_region": "keep",
"line_item_usage_type": "keep",
"resource_tags": "hash",
"column6": "uuid"
}
}Note: In CUR2, the
resource_tagscolumn contains all resource tags as a single JSON object. By default, the anonymiser hashes this column to protect tag values while preserving uniqueness for analysis.
{
"_comment": "Column options: 'keep', 'remove', 'hash', 'uuid'",
"columns": {
"BillingAccountId": "hash",
"BillingAccountName": "hash",
"SubAccountId": "hash",
"SubAccountIdName": "hash",
"InvoiceId": "hash",
"tag": "hash",
"ResourceId": "keep",
"ServiceName": "keep"
}
}Note: In Focus format, columns like
BillingAccountId,BillingAccountName,SubAccountId,SubAccountIdName,InvoiceId, andtagare hashed by default. All other columns are kept unless you change their action in the config.
See your Parquet columns:
duckdb -c "SELECT * FROM 'yourfile.parquet' LIMIT 0;"Export the first 100 rows as CSV:
duckdb -c "COPY (SELECT * FROM 'yourfile.parquet' LIMIT 100) TO STDOUT (HEADER, DELIMITER ',');"- The script reads your input Parquet file and applies the actions specified in the config.
- Account IDs are replaced with consistent, fake 12-digit numbers.
- ARNs are rebuilt using the fake account IDs, so relationships are preserved.
- Columns set to
hashare hashed with DuckDB’smd5_number_upperfunction—irreversible, but consistent (not cryptographically secure). - Columns set to
removevanish without a trace. Columns set tokeepare left alone, as nature intended. - Output can be Parquet or CSV, depending on your mood.
Flags:
--inputPath to the input Parquet file (required)--outputPath to the output file (required, unless using--create-config)--configPath to the JSON config file (required, unless using--create-config)--create-configGenerate a config file from the input Parquet file and exit--helpShow help and exit
- Unit tests:
- Run all tests:
pytest tests/
- Run all tests:
- Integration tests:
- Provide a small sample CUR2 and legacy CUR file in your repo (or generate one) and run the full anonymisation flow.
- Config validation:
- (Planned) Add a command to validate your config file against your input file, warning about missing or misconfigured columns.
- Install dependencies:
pip install -r requirements.txt - (Planned) PyPI packaging for easy install and CLI entry points.
- The hashing function (
md5_number_upper) is for anonymisation, not for cryptographic security. Do not use for secrets that require strong protection.
MIT. Because life’s too short for restrictive licences.
Pull requests, bug reports, and witty comments are welcome. If you spot a bug, fix it, or at least laugh at it in the issues section.
Crafted by Frank Contrepois, with a little help from AI, caffeine, and the occasional existential crisis.