Thanks to visit codestin.com
Credit goes to lib.rs

#disk #deduplication #duplicates

bin+lib bigfiles

Find what's eating your disk — a parallel directory scanner with category breakdown, stale-file flagging, and hardlink-aware duplicate detection

7 stable releases

new 1.3.0 May 15, 2026
1.1.2 May 11, 2026
1.0.2 May 11, 2026

#1443 in Filesystem

AGPL-3.0-or-later

115KB
3K SLoC

bigfiles

CI crates.io Downloads Stars License: AGPL v3

A small Rust CLI that walks a directory in parallel, groups files by type, flags stale ones, finds duplicates (hardlink-aware), and renders a color-coded summary in the terminal. Cross-platform: Linux, macOS, Windows.

What it does

  • Interactive TUI (bigfiles tui) — ncdu-style directory browser with arrow-key navigation, / filter, and o to reveal in OS file manager
  • Quick audit (bigfiles audit) — severity-coded "what's eating your disk" insights in one screen
  • Walks a directory tree in parallel and collects file sizes, extensions, and modified timestamps
  • Respects .gitignore and .ignore files by default (use --no-ignore to disable)
  • Skips symlinks (no double-counting, no follow-link footguns)
  • Groups files into categories: video, images, archives, audio, documents, code, junk, other
  • Flags files not modified in the last N years as stale
  • Renders a color-coded table with size bars, optionally with the largest files per category
  • Sortable category table (--sort size|count|stale-size|stale-count|name, --reverse)
  • Finds duplicate files by content hash with parallel BLAKE3 hashing, hardlink awareness, and a persistent on-disk cache so re-scans are near-instant
  • Interactively deletes stale files or duplicate copies with explicit confirmation
  • Emits JSON for piping into other tools
  • Colorized --help output via clap styles

Install

crates.io (requires Rust via rustup)

cargo install bigfiles

To upgrade:

cargo install bigfiles --force

Pre-built binaries

Download from the releases page for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows (x86_64). Extract and move bigfiles (or bigfiles.exe) onto your $PATH.

From source:

git clone https://github.com/Par-python/bigfiles
cd bigfiles
cargo install --path .

Usage

# Scan current directory
bigfiles

# Scan a specific path
bigfiles ~/Downloads

# Skip hidden files and dirs, only descend 3 levels
bigfiles ~ --skip-hidden --depth 3

# Show the 5 largest files per category alongside the summary
bigfiles ~/Downloads --top 5

# Exclude paths via glob (repeatable)
bigfiles ~ --exclude 'node_modules' --exclude '*.log' --exclude 'target'

# Don't respect .gitignore / .ignore
bigfiles ~/some-project --no-ignore

# Treat anything not modified in 5+ years as stale (default: 2)
bigfiles ~/Documents --stale-years 5

# Pipe JSON into jq (envelope: { version, root, total_size, skipped, categories })
bigfiles ~/Movies --json | jq '.categories[] | select(.stale_size > 1000000000)'

# Sort the breakdown by file count instead of size; reverse for smallest-first
bigfiles ~ --sort count
bigfiles ~ --sort size --reverse

# Quick "what's eating my disk" insights view
bigfiles audit ~

.gitignore awareness

By default bigfiles uses the same ignore crate that ripgrep uses, so .gitignore, .ignore, and global git excludes are respected automatically. Scanning a Rust project? target/ is skipped. Node project? node_modules is skipped. No flag needed.

Use --no-ignore to walk everything regardless.

Interactive TUI

bigfiles tui <path> opens a full-screen ncdu-style directory browser. Sizes are aggregated per directory; the largest entries float to the top.

bigfiles tui ~

Keys: / (or j/k) move • Enter/ descend into directory • /Backspace go up • / filter children by substring (Esc cancels, Enter keeps) • o reveal selected entry in your OS file manager (open -R on macOS, explorer /select, on Windows, xdg-open on Linux) • q/Esc quit • ? toggle help.

Quick audit

bigfiles audit <path> runs a normal scan, then prints a short list of severity-coded insights about what's eating your disk — heaviest category, top extensions, installer-junk total (.dmg/.pkg/.iso/.exe/.msi/.deb/.rpm), top-N-file concentration, and the share of stale data. Useful as a first-run "where do I start?" view.

bigfiles audit ~

Insights are bulleted by severity: red ! for heavy (≥40% of total), yellow for notable (≥20%), dimmed · for informational. Respects --stale-years and all global filters (--skip-hidden, --exclude, --depth, etc.).

Find duplicate files

bigfiles dupes finds files with identical content. It uses a fast three-stage check, parallelized with rayon:

  1. Group by size
  2. Hash first/last 4 KB (partial_hash)
  3. Full BLAKE3 hash on remaining candidates

Hardlinks are collapsed by inode before hashing, so multiple paths pointing to the same on-disk file are reported as a single entry (and don't inflate "reclaimable" numbers). When a duplicate group includes hardlinks, the additional paths are shown indented under the primary path.

# Find dupes >= 1 MB in Downloads
bigfiles dupes ~/Downloads --min-size 1048576

# Default min-size is 1 KB; tune as needed
bigfiles dupes ~/Documents --min-size 1

Delete duplicate copies (interactive)

bigfiles dupes --delete walks each duplicate group and lets you pick which copy to keep; the rest are queued for deletion. After all groups are processed, you get a red summary and a y/N confirm before any file is touched.

bigfiles dupes ~/Downloads --delete

Safety guarantees:

  • Per-group single-choice picker — you can only delete by not picking one to keep
  • Every group offers a "skip — keep all" option; Esc also skips
  • Always keeps ≥1 copy per group (it's structurally impossible to empty a group)
  • No deletion happens until the final y/N confirm; default is No
  • Files are re-stat'd immediately before removal; non-regular files (symlinks, sockets, devices) are refused
  • Files are removed permanently — they do not go to Trash

Note that dupes are only ever paired within the scan root. If two copies live in separate trees, scan a common parent.

Persistent hash cache

bigfiles dupes caches full-file BLAKE3 hashes to your OS cache directory so subsequent runs over the same tree are near-instant.

  • Location: ~/Library/Caches/bigfiles/hashes.json (macOS), $XDG_CACHE_HOME/bigfiles/hashes.json (Linux), %LOCALAPPDATA%\bigfiles\Cache\hashes.json (Windows)
  • Cache key: (path, mtime, size) — any change invalidates the entry, forcing a re-hash. Tagged with the hash algorithm so the cache survives version bumps.
  • Pruning: entries for paths that no longer exist are dropped on the next save.
  • --no-cache runs without touching the cache (no read, no write).
  • --clear-cache deletes the cache file before running.

In local testing, the warm cache is roughly 40× faster than a cold first run on the same tree.

Delete stale files (interactive)

bigfiles delete shows you every file older than --stale-years (default 2) in an interactive checklist. You tick which ones to remove, see a confirmation summary, and only then are files deleted. Files are removed permanently — they do not go to Trash.

bigfiles delete ~/Downloads --stale-years 3

The flow: list → tick boxes (Space) → Enter → review summary → type y to confirm. Hit Ctrl-C any time to bail.

Flags (global)

Flag Default Description
<PATH> . Directory to scan
-s, --stale-years <N> 2 Flag files not modified in this many years as stale
-H, --skip-hidden off Skip dotfiles and dot-directories
-d, --depth <N> unlimited Limit traversal depth (1 = only files directly in root)
--no-ignore off Do not respect .gitignore / .ignore files
--no-pager off Don't auto-page output through $PAGER
-e, --exclude <GLOB> none Skip files/dirs matching this glob; repeatable
--units <STYLE> default Byte unit style: default (1024, KB/MB), iec (1024, KiB/MiB), si (1000, KB/MB)
--color <WHEN> auto Color output: auto, always, never. Also respects NO_COLOR.
-t, --top <N> off Show N largest files per category (default scan only)
-j, --json off Emit raw JSON (default scan only)
--sort <KEY> size Sort categories by size, count, stale-size, stale-count, or name (default scan only)
--reverse off Reverse the sort order (default scan only)

Flags (dupes subcommand)

Flag Default Description
--min-size <BYTES> 1024 Minimum file size to consider
--delete off Interactively delete duplicate copies (keep one per group)
--no-cache off Skip the persistent hash cache for this run
--clear-cache off Delete the persistent hash cache before running

Pager

When stdout is a real terminal, bigfiles auto-pages output through $PAGER (default less -FRX) — same UX as git log. Short output passes through instantly thanks to -F; long output (e.g. bigfiles ~ --top 20) opens scrollable. Use arrow keys / / to search / q to quit.

The pager is automatically skipped when:

  • output is piped (bigfiles ... | jq works as expected)
  • --json is set
  • the delete subcommand is running (interactive)
  • --no-pager is passed

Example output

  bigfiles 8.18 GB  /Users/you/Downloads

  category           size                            files    stale
  ────────────────────────────────────────────────────────────────────────
  video           3.30 GB  ██████████                   45
  archives        2.81 GB  ████████                     44
  documents       1.23 GB  ███                         362
  audio         410.3 MB29
  images        326.9 MB                               30091.9 MB (12 files)
  other         115.5 MB                               3582.5 MB (302 files)
  code          721.3 KB                                2526.0 KB (14 files)

How "stale" is detected

bigfiles uses the file's modified time (mtime), not access time. Many filesystems disable access-time updates by default (Linux noatime, modern macOS volumes), so atime is unreliable for staleness. mtime is updated whenever a file's contents change, which is a better signal for "this file is forgotten."

Project layout

src/
  main.rs        # CLI entry, subcommand dispatch, clap styles
  walker.rs      # Parallel directory traversal, file collection, inode capture
  classifier.rs  # Extension → category mapping
  analyzer.rs    # Grouping, sorting, stale detection
  renderer.rs    # Default scan output
  dupes.rs       # Duplicate detection (parallel, hardlink-aware) + interactive delete
  delete.rs      # Interactive stale-file deletion
  format.rs      # Shared byte-size formatter

Platform notes

  • Linux & macOS: full feature set, including hardlink-aware dupe detection and pager auto-launch.
  • Windows: builds and runs cleanly via cargo build --release (CI covers windows-latest). Two caveats:
    • Pager is disabled on Windows (there's no portable less). Output prints straight to stdout — pipe to more or use Windows Terminal's scrollback. The --no-pager flag is a no-op there.
    • Hardlink detection is currently inactive — the inode/file-index API is nightly-only on std. Dupe detection still works, but hardlinks are treated as separate entries instead of being collapsed.

Performance notes

bigfiles uses a parallel directory walker (the ignore crate, same engine as ripgrep) which is fast and portable across macOS, Linux, and Windows. It is not the theoretical fastest approach on every platform:

  • Windows / NTFS: Reading the Master File Table (MFT) directly enumerates every file on a volume in one sequential pass. Tools like everything.exe use this. bigfiles does not, yet. See #1.
  • macOS / APFS: The volume catalog can be read in bulk via getattrlistbulk, which is faster than readdir on large trees. bigfiles does not exploit this. See #2.
  • Linux (ext4 / btrfs / xfs): No comparable shortcut. Standard parallel walking is close to optimal.

If you have benchmarks against fd, dust, fclones, or other scanners, open an issue. Honest numbers are welcome.

Benchmarks: ignore vs jwalk

A common suggestion is to swap the ignore-based walker for jwalk for higher throughput. The numbers don't support it for bigfiles' workload. Run with cargo bench --bench walker_bench.

Tree shape ignore (gitignore on) ignore (gitignore off) jwalk
Shallow-wide (10k files in one dir) 11.0 ms 7.2 ms 7.4 ms
Deep-narrow (50 levels × 1 file) 1.73 ms 1.02 ms 1.43 ms
Realistic (src + ignored node_modules) 0.34 ms 9.7 ms 4.6 ms

Takeaways:

  • With gitignore parsing disabled, ignore and jwalk are effectively tied. ignore actually wins the deep-narrow case.
  • The realistic workload is where ignore pulls ahead by ~28×: a .gitignore excluding node_modules lets the walker skip ~5,000 files without ever calling stat. jwalk doesn't have gitignore support out of the box, so it walks everything.
  • For bigfiles' actual users (dev machines with node_modules, target/, .venv/), the gitignore-aware skip wins by a margin no raw-throughput improvement could close.

Measured on macOS (Apple Silicon) with Criterion, 100 samples each. Numbers are wall-clock per iteration.

Caveats

  • Deletion is permanent for both delete and dupes --delete — nothing goes to Trash. The interactive flow exists precisely to keep that decision explicit; there is no --force or non-interactive delete mode by design.
  • Dupe pairing is relative to the scan root. If two copies live in separate trees (e.g. ~/A/file and ~/B/file), running bigfiles ~/A dupes won't find them. Scan a common parent.
  • --top and --json only apply to the default scan. They're accepted but ignored under dupes/delete (a stderr note is printed).
  • Symlinks are skipped entirely. If you rely on symlink farms for organization, walking through them isn't supported — point bigfiles at the real paths.

Future ideas

  • Per-directory breakdown ("top 10 heaviest subdirectories")
  • --watch mode that re-scans on an interval
  • A full TUI with ratatui (expand/collapse categories, arrow-key navigation)
  • Persistent index in ~/.cache/bigfiles/ to diff scans over time
  • Replace dupes with hardlinks (--link mode) instead of deleting

Stability

Starting with 1.0, the CLI surface and JSON schema follow semver:

  • CLI flags: removing a flag, changing its short form, or changing default behavior requires a major version bump. New flags are minor.
  • JSON output: the "version": 1 envelope is stable. Breaking changes ship as "version": 2. Adding new fields is minor.
  • Exit codes: 0 success, 1 runtime error, 2 usage error.
  • Internal Rust API: not stable. Use the binary, not the library crate.

License

AGPL-3.0-or-later — see LICENSE.

Dependencies

~17–33MB
~506K SLoC