TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.
It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and remote Git repositories even under specific directory.
Status: It's safe to say it's ready for daily use, as I've been using for a while now.
cargo install dumpfsnpm install @kkharji/dumpfs# Basic usage (current directory, output to stdout)
dumpfs gen
# Scan a specific directory with output to a file
dumpfs gen /path/to/project -o project_dump.md
# Scan with specific output format
dumpfs gen . -o output.xml -f xml
# Copy to the generate content Clipboard
dumpfs gen . --clip
# Filter files using ignore patterns
dumpfs gen . -i "*.log,*.tmp,node_modules/*"
# Include only specific files
dumpfs gen . -I "*.rs,*.toml"
# Show additional metadata in output
dumpfs gen . -smp
# Skip file contents, show only structure
dumpfs gen . --skip-content
# Scan a remote Git repository
dumpfs gen https://github.com/username/repo -o repo_dump.md
# Generate completion (supports bash)
dumpfs completion zsh ~/.config/zsh/completions/_dumpfsimport { scan } from '@kkharji/dumpfs';
// Basic usage - scan current directory
const result = await scan('.');
const llmText = await result.llmText();
console.log(llmText);
// With options - scan with custom settings
const result = await scan('/path/to/project', {
maxDepth: 3,
ignorePatterns: ['node_modules/**', '*.log'],
includePatterns: ['*.js', '*.ts', '*.json'],
skipContent: false,
model: 'gpt4' // Enable token counting
});
// Generate different output formats
const markdownOutput = await result.llmText();
// Customize output options
const customOutput = await result.llmText({
showPermissions: true,
showSize: true,
showModified: true,
includeTreeOutline: true,
omitFileContents: false
});🚀 Node.js Bindings (NAPI)
- Full JavaScript/TypeScript library with async support
- Cross-platform native modules for optimal performance
- Type definitions included for better development experience
- Available on npm as
@kkharji/dumpfs
🧠 Token Counting & LLM Integration
- Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
- Model-aware content analysis and optimization
- Caching system for efficient repeated tokenization
- Support for content-based token estimation
⚡ Enhanced CLI
- Output to stdout for better shell integration
- Clipboard support for seamless workflow
- Improved progress reporting and error handling
- Better filtering and ignore patterns
🔧 Performance Improvements
- Optimized parallel processing with configurable thread counts
- Enhanced file type detection and text classification
- Better memory management for large codebases
- Improved handling of symlinks and permissions
The architecture supports several important features:
- Parallel Processing: Uses worker threads for efficient filesystem traversal and processing
- Flexible Input: Handles both local and remote code sources uniformly
- Smart Filtering: Provides multiple ways to filter content:
- File size limits
- Modified dates
- Permissions
- Gitignore patterns
- Custom include/exclude patterns
- Token Counting & LLM Integration:
- Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
- Implements caching for efficient tokenization
- Model-aware content analysis and optimization
- Performance Optimization:
- Uses efficient buffered I/O
- Provides progress tracking
- Supports cancelation
- Extensibility:
- Modular design for adding new tokenizers
- Support for multiple output formats
- Pluggable formatter system
- User input (path/URL) → Subject
- Subject initializes appropriate source (local/remote)
- Scanner traverses files with parallel workers
- Files are processed according to type and options
- Results are collected into a tree structure
- Formatter converts tree to desired output format
- Results are saved or displayed
dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:
- Acts as the central coordinator for processing input sources
- Handles both local directories and remote Git repositories
- Provides high-level API for scanning and formatting operations
- Handles recursive directory traversal and file analysis
- Implements parallel processing via worker threads for performance
- Detects file types and extracts content & metadata
- Manages filtering based on various criteria (size, date, permissions)
- Parses and validates remote repository URLs
- Extracts repository metadata (owner, name, branch)
- Manages cloning and updating of remote repositories
- Handles authentication and credentials
- Provides access to repository contents for scanning
- Implements token counting for various LLM models
- Supports multiple providers (OpenAI, Anthropic, HuggingFace)
- Includes caching to avoid redundant tokenization
- Tracks statistics for optimization
- Converts scanned filesystem data into LLM-friendly formats
- Supports multiple output formats (Markdown, XML, JSON)
- Handles metadata inclusion and content organization
- Provides a centralized error type system
- Implements custom error conversion and propagation
- Ensures consistent error handling across modules
- Manages persistent caching of tokenization results
- Provides cache location and naming utilities
- Implements command-line interface using clap
- Processes user options and coordinates operations
- Provides progress feedback and reporting