csvsmith is a small collection of CSV utilities.
Current focus:
- Duplicate value counting (
count_duplicates_sorted) - Row-level digest creation (
add_row_digest) - Duplicate-row detection (
find_duplicate_rows) - Deduplication with full duplicate report (
dedupe_with_report) - Command-line interface (CLI) for quick operations
From PyPI (future):
pip install csvsmithFor local development:
git clone https://github.com/yeiichi/csvsmith.git
cd csvsmith
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]from csvsmith import count_duplicates_sorted
items = ["a", "b", "a", "c", "a", "b"]
print(count_duplicates_sorted(items))
# [('a', 3), ('b', 2)]import pandas as pd
from csvsmith import find_duplicate_rows
df = pd.read_csv("input.csv")
dup_rows = find_duplicate_rows(df)
print(dup_rows)import pandas as pd
from csvsmith import dedupe_with_report
df = pd.read_csv("input.csv")
# Use all columns
deduped, report = dedupe_with_report(df)
deduped.to_csv("deduped.csv", index=False)
report.to_csv("duplicate_report.csv", index=False)
# Use all columns except an ID column
deduped_no_id, report_no_id = dedupe_with_report(df, exclude=["id"])csvsmith includes a small command-line interface for duplicate detection
and CSV deduplication.
csvsmith row-duplicates input.csvSave only duplicate rows to a file:
csvsmith row-duplicates input.csv -o duplicates_only.csvUse only a subset of columns to determine duplicates:
csvsmith row-duplicates input.csv --subset col1 col2 -o dup_rows_subset.csvExclude ID column(s) when looking for duplicates:
csvsmith row-duplicates input.csv --exclude id -o dup_rows_no_id.csvcsvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csvcsvsmith dedupe input.csv --subset col1 col2 --deduped deduped_subset.csv --report duplicate_report_subset.csvcsvsmith dedupe input.csv --subset col1 --keep False --deduped deduped_no_dups.csv --report duplicate_report_col1.csvExclude “id” from duplicate logic:
csvsmith dedupe input.csv --exclude id --deduped deduped_no_id.csv --report duplicate_report_no_id.csv- CSVs deserve tools that are simple, predictable, and transparent.
- A row has meaning only when its identity is stable and hashable.
- Collisions are sin; determinism is virtue.
- Let no delimiter sow ambiguity among fields.
- Love thy
\x1f.
The unseen separator, the quiet guardian of clean hashes.
Chosen not for aesthetics, but for truth. - The pipeline should be silent unless something is wrong.
- Your data deserves respect — and your tools should help you give it.
For more, see MANIFESTO.md.
MIT License.