Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Par-python/bigfiles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bigfiles

CI crates.io Downloads Stars License

A small Rust CLI that walks a directory in parallel, groups files by type, flags stale ones, finds duplicates (hardlink-aware), and renders a color-coded summary in the terminal. Cross-platform: Linux, macOS, Windows.

bigfiles-demo.mp4

What it does

  • Interactive TUI (bigfiles tui) — ncdu-style directory browser with arrow-key navigation, / filter, o to reveal in OS file manager, d to send to Trash, D for dupes in the current subtree, r to re-scan
  • Quick audit (bigfiles audit) — severity-coded "what's eating your disk" insights in one screen
  • bigfiles top — flat list of the N largest files, no category grouping
  • Safe by default: delete and dupes --delete send to Trash by default; --force opts into permanent deletion
  • Walks a directory tree in parallel and collects file sizes, extensions, and modified timestamps
  • Respects .gitignore and .ignore files by default (use --no-ignore to disable)
  • Skips symlinks (no double-counting, no follow-link footguns)
  • Groups files into categories: video, images, archives, audio, documents, code, junk, other
  • Flags files not modified in the last N years as stale
  • Renders a color-coded table with size bars, optionally with the largest files per category
  • Sortable category table (--sort size|count|stale-size|stale-count|name, --reverse)
  • Finds duplicate files by content hash with parallel BLAKE3 hashing, hardlink awareness, and a persistent on-disk cache so re-scans are near-instant
  • Interactively deletes stale files or duplicate copies with explicit confirmation
  • Emits JSON for piping into other tools
  • Colorized --help output via clap styles

Install

crates.io (requires Rust via rustup)

cargo install bigfiles

To upgrade:

cargo install bigfiles --force

Pre-built binaries

Download from the releases page for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows (x86_64). Extract and move bigfiles (or bigfiles.exe) onto your $PATH.

From source:

git clone https://github.com/Par-python/bigfiles
cd bigfiles
cargo install --path .

Usage

# Scan current directory
bigfiles

# Scan a specific path
bigfiles ~/Downloads

# Skip hidden files and dirs, only descend 3 levels
bigfiles ~ --skip-hidden --depth 3

# Show the 5 largest files per category alongside the summary
bigfiles ~/Downloads --top 5

# Exclude paths via glob (repeatable)
bigfiles ~ --exclude 'node_modules' --exclude '*.log' --exclude 'target'

# Don't respect .gitignore / .ignore
bigfiles ~/some-project --no-ignore

# Treat anything not modified in 5+ years as stale (default: 2)
bigfiles ~/Documents --stale-years 5

# Pipe JSON into jq (envelope: { version, root, total_size, skipped, categories })
bigfiles ~/Movies --json | jq '.categories[] | select(.stale_size > 1000000000)'

# Sort the breakdown by file count instead of size; reverse for smallest-first
bigfiles ~ --sort count
bigfiles ~ --sort size --reverse

# Quick "what's eating my disk" insights view
bigfiles audit ~

# Show just the 10 largest files anywhere under the path
bigfiles top ~/Downloads -n 10

# Send stale files to Trash (default), or permanently delete with --force
bigfiles delete ~/Downloads
bigfiles delete ~/Downloads --force

.gitignore awareness

By default bigfiles uses the same ignore crate that ripgrep uses, so .gitignore, .ignore, and global git excludes are respected automatically. Scanning a Rust project? target/ is skipped. Node project? node_modules is skipped. No flag needed.

Use --no-ignore to walk everything regardless.

Interactive TUI

bigfiles tui <path> opens a full-screen ncdu-style directory browser. Sizes are aggregated per directory; the largest entries float to the top.

bigfiles tui ~

Keys:

  • ↑/↓ or j/k — move
  • Enter or — descend into directory
  • or Backspace — go up
  • / — filter children by substring (Esc cancels, Enter keeps)
  • o — reveal selected entry in your OS file manager (open -R on macOS, explorer /select, on Windows, xdg-open on Linux)
  • d — send the selected file or directory to Trash (yellow confirm bar appears; y/Enter confirms, any other key cancels). Trash-only in the TUI for safety. For permanent delete, use bigfiles delete --force or bigfiles dupes --delete --force from the CLI.
  • D — open a duplicate-detection popup scoped to the currently-highlighted directory. Uses the persistent hash cache, so repeat runs over the same subtree are near-instant. j/k or PgUp/PgDn to scroll, Esc/q to close.
  • r — re-run the scan from disk (the TUI exits briefly, shows the spinner, then re-enters with fresh data)
  • q/Esc — quit
  • ? — toggle help

Quick audit

bigfiles audit <path> runs a normal scan, then prints a short list of severity-coded insights about what's eating your disk — heaviest category, top extensions, installer-junk total (.dmg/.pkg/.iso/.exe/.msi/.deb/.rpm), top-N-file concentration, and the share of stale data. Useful as a first-run "where do I start?" view.

bigfiles audit ~

Insights are bulleted by severity: red ! for heavy (≥40% of total), yellow for notable (≥20%), dimmed · for informational. Respects --stale-years and all global filters (--skip-hidden, --exclude, --depth, etc.).

Top N largest files

bigfiles top <path> prints a flat list of the N largest files under the given path, sorted by size descending. No category grouping, no bars, no stale flags — just the biggest files. Pairs well with | head, | grep, or pipelines.

# Default: top 20
bigfiles top ~/Downloads

# Top 5
bigfiles top ~/Movies -n 5

# Pipe into other tools
bigfiles top ~ -n 100 | grep '\.mp4'

Respects all global filters (--skip-hidden, --exclude, --depth, --no-ignore).

Find duplicate files

bigfiles dupes finds files with identical content. It uses a fast three-stage check, parallelized with rayon:

  1. Group by size
  2. Hash first/last 4 KB (partial_hash)
  3. Full BLAKE3 hash on remaining candidates

Hardlinks are collapsed by inode before hashing, so multiple paths pointing to the same on-disk file are reported as a single entry (and don't inflate "reclaimable" numbers). When a duplicate group includes hardlinks, the additional paths are shown indented under the primary path.

# Find dupes >= 1 MB in Downloads
bigfiles dupes ~/Downloads --min-size 1048576

# Default min-size is 1 KB; tune as needed
bigfiles dupes ~/Documents --min-size 1

Remove duplicate copies (interactive)

bigfiles dupes --delete walks each duplicate group and lets you pick which copy to keep; the rest are queued for removal. After all groups are processed, you get a summary and a y/N confirm before any file is touched.

By default, removed copies are moved to your OS Trash (recoverable). Pair with --force to delete permanently.

# Default: moves duplicate copies to Trash
bigfiles dupes ~/Downloads --delete

# Permanent removal (not recoverable)
bigfiles dupes ~/Downloads --delete --force

Safety guarantees:

  • Per-group single-choice picker — you can only remove by not picking one to keep
  • Every group offers a "skip — keep all" option; Esc also skips
  • Always keeps ≥1 copy per group (it's structurally impossible to empty a group)
  • No removal happens until the final y/N confirm; default is No
  • Files are re-stat'd immediately before removal; non-regular files (symlinks, sockets, devices) are refused
  • Trash by default: moved copies can be restored from your OS Trash unless you pass --force
  • If Trash is unavailable (e.g. on some network mounts), the operation refuses and points you at --force

Note that dupes are only ever paired within the scan root. If two copies live in separate trees, scan a common parent.

Persistent hash cache

bigfiles dupes caches full-file BLAKE3 hashes to your OS cache directory so subsequent runs over the same tree are near-instant.

  • Location: ~/Library/Caches/bigfiles/hashes.json (macOS), $XDG_CACHE_HOME/bigfiles/hashes.json (Linux), %LOCALAPPDATA%\bigfiles\Cache\hashes.json (Windows)
  • Cache key: (path, mtime, size) — any change invalidates the entry, forcing a re-hash. Tagged with the hash algorithm so the cache survives version bumps.
  • Pruning: entries for paths that no longer exist are dropped on the next save.
  • --no-cache runs without touching the cache (no read, no write).
  • --clear-cache deletes the cache file before running.

In local testing, the warm cache is roughly 40× faster than a cold first run on the same tree.

Remove stale files (interactive)

bigfiles delete shows you every file older than --stale-years (default 2) in an interactive checklist. You tick which ones to remove, see a confirmation summary, and only then are files touched.

By default, selected files are moved to your OS Trash (recoverable). Pair with --force to delete permanently.

# Default: moves stale files to Trash
bigfiles delete ~/Downloads --stale-years 3

# Permanent removal
bigfiles delete ~/Downloads --stale-years 3 --force

The flow: list → tick boxes (Space) → Enter → review summary → type y to confirm. Hit Ctrl-C any time to bail. If Trash is unavailable, the operation refuses and points you at --force.

Flags (global)

Flag Default Description
<PATH> . Directory to scan
-s, --stale-years <N> 2 Flag files not modified in this many years as stale
-H, --skip-hidden off Skip dotfiles and dot-directories
-d, --depth <N> unlimited Limit traversal depth (1 = only files directly in root)
--no-ignore off Do not respect .gitignore / .ignore files
--no-pager off Don't auto-page output through $PAGER
-e, --exclude <GLOB> none Skip files/dirs matching this glob; repeatable
--units <STYLE> default Byte unit style: default (1024, KB/MB), iec (1024, KiB/MiB), si (1000, KB/MB)
--color <WHEN> auto Color output: auto, always, never. Also respects NO_COLOR.
-t, --top <N> off Show N largest files per category (default scan only)
-j, --json off Emit raw JSON (default scan only)
--sort <KEY> size Sort categories by size, count, stale-size, stale-count, or name (default scan only)
--reverse off Reverse the sort order (default scan only)

Flags (dupes subcommand)

Flag Default Description
--min-size <BYTES> 1024 Minimum file size to consider
--delete off Interactively remove duplicate copies (keep one per group). Moves to Trash by default.
--force off When paired with --delete, permanently deletes instead of moving to Trash. Cannot be undone.
--no-cache off Skip the persistent hash cache for this run
--clear-cache off Delete the persistent hash cache before running

Flags (delete subcommand)

Flag Default Description
--force off Permanently delete selected files instead of moving them to Trash. Cannot be undone.

Flags (top subcommand)

Flag Default Description
-n, --n <N> 20 Number of largest files to show

Pager

When stdout is a real terminal, bigfiles auto-pages output through $PAGER (default less -FRX) — same UX as git log. Short output passes through instantly thanks to -F; long output (e.g. bigfiles ~ --top 20) opens scrollable. Use arrow keys / / to search / q to quit.

The pager is automatically skipped when:

  • output is piped (bigfiles ... | jq works as expected)
  • --json is set
  • the delete subcommand is running (interactive)
  • --no-pager is passed

Example output

  bigfiles 8.18 GB  /Users/you/Downloads

  category           size                            files    stale
  ────────────────────────────────────────────────────────────────────────
  video           3.30 GB  ██████████                   45
  archives        2.81 GB  ████████                     44
  documents       1.23 GB  ███                         362
  audio         410.3 MB   █                            29
  images        326.9 MB                               300    ⚠ 91.9 MB (12 files)
  other         115.5 MB                               358    ⚠ 2.5 MB (302 files)
  code          721.3 KB                                25    ⚠ 26.0 KB (14 files)

How "stale" is detected

bigfiles uses the file's modified time (mtime), not access time. Many filesystems disable access-time updates by default (Linux noatime, modern macOS volumes), so atime is unreliable for staleness. mtime is updated whenever a file's contents change, which is a better signal for "this file is forgotten."

Project layout

src/
  main.rs        # CLI entry, subcommand dispatch, clap styles
  walker.rs      # Parallel directory traversal, file collection, inode capture
  classifier.rs  # Extension → category mapping
  analyzer.rs    # Grouping, sorting, stale detection
  renderer.rs    # Default scan output
  dupes.rs       # Duplicate detection (parallel, hardlink-aware) + interactive delete
  delete.rs      # Interactive stale-file deletion
  format.rs      # Shared byte-size formatter

Platform notes

  • Linux & macOS: full feature set, including hardlink-aware dupe detection and pager auto-launch.
  • Windows: builds and runs cleanly via cargo build --release (CI covers windows-latest). Two caveats:
    • Pager is disabled on Windows (there's no portable less). Output prints straight to stdout — pipe to more or use Windows Terminal's scrollback. The --no-pager flag is a no-op there.
    • Hardlink detection is currently inactive — the inode/file-index API is nightly-only on std. Dupe detection still works, but hardlinks are treated as separate entries instead of being collapsed.

Performance notes

bigfiles uses a parallel directory walker (the ignore crate, same engine as ripgrep) which is fast and portable across macOS, Linux, and Windows. It is not the theoretical fastest approach on every platform:

  • Windows / NTFS: Reading the Master File Table (MFT) directly enumerates every file on a volume in one sequential pass. Tools like everything.exe use this. bigfiles does not, yet. See #1.
  • macOS / APFS: The volume catalog can be read in bulk via getattrlistbulk, which is faster than readdir on large trees. bigfiles does not exploit this. See #2.
  • Linux (ext4 / btrfs / xfs): No comparable shortcut. Standard parallel walking is close to optimal.

If you have benchmarks against fd, dust, fclones, or other scanners, open an issue. Honest numbers are welcome.

Benchmarks: ignore vs jwalk

A common suggestion is to swap the ignore-based walker for jwalk for higher throughput. The numbers don't support it for bigfiles' workload. Run with cargo bench --bench walker_bench.

Tree shape ignore (gitignore on) ignore (gitignore off) jwalk
Shallow-wide (10k files in one dir) 11.0 ms 7.2 ms 7.4 ms
Deep-narrow (50 levels × 1 file) 1.73 ms 1.02 ms 1.43 ms
Realistic (src + ignored node_modules) 0.34 ms 9.7 ms 4.6 ms

Takeaways:

  • With gitignore parsing disabled, ignore and jwalk are effectively tied. ignore actually wins the deep-narrow case.
  • The realistic workload is where ignore pulls ahead by ~28×: a .gitignore excluding node_modules lets the walker skip ~5,000 files without ever calling stat. jwalk doesn't have gitignore support out of the box, so it walks everything.
  • For bigfiles' actual users (dev machines with node_modules, target/, .venv/), the gitignore-aware skip wins by a margin no raw-throughput improvement could close.

Measured on macOS (Apple Silicon) with Criterion, 100 samples each. Numbers are wall-clock per iteration.

Caveats

  • Removal is to Trash by default for both delete and dupes --delete. Pair with --force for permanent deletion (cannot be undone). The TUI is Trash-only — use the CLI with --force if you need permanent removal.
  • If Trash is unavailable (some network mounts, certain restricted environments), the operation refuses cleanly and asks you to re-run with --force. bigfiles will not silently fall back to permanent deletion.
  • Dupe pairing is relative to the scan root. If two copies live in separate trees (e.g. ~/A/file and ~/B/file), running bigfiles ~/A dupes won't find them. Scan a common parent.
  • --top, --json, --sort, --reverse only apply to the default scan. They're accepted but ignored under dupes/delete/audit/top/tui (a stderr note is printed).
  • Symlinks are skipped entirely. If you rely on symlink farms for organization, walking through them isn't supported — point bigfiles at the real paths.

Future ideas

  • Per-directory breakdown ("top 10 heaviest subdirectories")
  • --watch mode that re-scans on an interval
  • A full TUI with ratatui (expand/collapse categories, arrow-key navigation)
  • Persistent index in ~/.cache/bigfiles/ to diff scans over time
  • Replace dupes with hardlinks (--link mode) instead of deleting

Stability

Starting with 1.0, the CLI surface and JSON schema follow semver:

  • CLI flags: removing a flag, changing its short form, or changing default behavior requires a major version bump. New flags are minor.
  • JSON output: the "version": 1 envelope is stable. Breaking changes ship as "version": 2. Adding new fields is minor.
  • Exit codes: 0 success, 1 runtime error, 2 usage error.
  • Internal Rust API: not stable. Use the binary, not the library crate.

License

AGPL-3.0-or-later — see LICENSE.

About

program to find stale and duplicate files in the depths of your computer

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages