1 unstable release

new 0.3.0	Jan 20, 2026

#1093 in Text processing

MIT license

725KB
13K SLoC

colgrep

Semantic code search powered by ColBERT multi-vector embeddings and the PLAID algorithm.

Features

Semantic Search: Find code using natural language queries
Hybrid Search: Combine text matching (-e) with semantic ranking
Grep-like Flags: Familiar -r, -e, -E, --include, -l flags for filtering results
Selective Indexing: When using filters, only matching files are indexed
5-Layer Code Analysis: Rich embeddings from AST, call graph, control flow, data flow, and dependencies
File Path Aware: Normalized file paths are included in embeddings for path-based semantic search
18 Languages: Python, TypeScript, JavaScript, Go, Rust, Java, C, C++, Ruby, C#, Kotlin, Swift, Scala, PHP, Lua, Elixir, Haskell, OCaml
Config & Docs: Also indexes YAML, TOML, JSON, Markdown, Dockerfile, Makefile, shell scripts
Incremental Updates: Only re-indexes changed files using content hashing
Auto-Indexing: Automatically builds index on first search
Smart Size Limits: Skips files >512KB to avoid memory issues with large generated files
Fast: ColBERT late interaction with PLAID compression for sub-second queries

Installation

Pre-built Binaries (Recommended)

macOS / Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.sh | sh

Windows (PowerShell):

powershell -c "irm https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.ps1 | iex"

Using Cargo

If you have Rust installed:

cargo install colgrep

Installing Rust

If you don't have Rust installed, install it first:

macOS / Linux:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Windows:

Download and run rustup-init.exe or use PowerShell:

winget install Rustlang.Rustup

After installation, restart your terminal and verify with rustc --version.

From Source

git clone https://github.com/lightonai/next-plaid.git
cd next-plaid/colgrep
cargo install --path .

ONNX Runtime (Automatic)

ONNX Runtime is automatically downloaded on first use if not found on your system. No manual installation required.

The CLI searches for ONNX Runtime in:

ORT_DYLIB_PATH environment variable
Python environments (pip/conda/venv)
System paths

If not found, it downloads from GitHub releases to ~/.cache/onnxruntime/.

For GPU support, install manually:

pip install onnxruntime-gpu

Usage

Search

# Search in current directory (auto-indexes if needed)
colgrep "error handling in API"

# Search in specific directory
colgrep "database connection" /path/to/project

# Limit results
colgrep "authentication" -k 5

# JSON output
colgrep "parse config" --json

# Explicit subcommand (same behavior)
colgrep search "query"

Grep-like Filtering

Filter search results using familiar grep-style flags:

# -r: Recursive search (default behavior, for grep compatibility)
colgrep -r "database" .

# --include: Filter by file pattern (can be used multiple times)
colgrep --include="*.py" "database connection" .
colgrep --include="*.rs" --include="*.go" "error handling" .

# -l: List files only (show unique filenames, not code details)
colgrep -l "authentication" .

# --code-only: Skip text/config files (md, txt, yaml, json, toml, etc.)
colgrep --code-only "authentication" .

# -n/--lines: Control context lines (default: 6)
colgrep -n 10 "database connection" .    # Show 10 lines per result

# Combine flags (like grep -rl)
colgrep -r -l --include="*.ts" "fetch API" .

Supported patterns for --include:

Pattern	Matches
`*.py`	Files with `.py` extension (in any directory)
`*/.py`	Same as above (explicit recursive)
`src/*/.rs`	`.rs` files under any `src/` directory
`/.github//*`	All files in `.github/` directories
`test`	Files containing "test" in name
`*_test.go`	Go test files (suffix pattern)
`*.spec.ts`	Files ending with `.spec.ts`

The --include flag supports full glob patterns including ** for recursive directory matching. Multiple patterns can be combined (OR logic).

Hybrid Search: Text + Semantic

Use -e/--pattern to first filter files using grep (text match), then rank results with semantic search:

# Find files containing "TODO", then semantically search for "error handling"
colgrep -e "TODO" "error handling" .

# Combine with --include for precise filtering
colgrep -e "async" --include="*.ts" "promise handling" .

# List only files containing "deprecated" that match "migration"
colgrep -l -e "deprecated" "migration guide" .

Extended Regular Expressions (ERE):

Use -E/--extended-regexp to enable extended regex syntax for the -e pattern:

# Alternation: find files containing "fn" OR "struct"
colgrep -e "fn|struct" -E "rust definitions" .

# Quantifiers: one or more digits
colgrep -e "error[0-9]+" -E "error codes" .

# Optional: match "color" or "colour"
colgrep -e "colou?r" -E "color handling" .

# Grouping with alternation
colgrep -e "(get|set)Value" -E "accessor methods" .

How it works:

grep -rl (or grep -rlE with -E) finds all files containing the text pattern
Filtering retrieves code unit IDs from those files
Semantic search ranks only those candidates
Exact grep matches are shown at the end with context lines

This is useful when you know a specific term exists in the code but want semantic understanding of the context.

Context lines (-n/--lines):

Control how many lines of code are shown per result:

# Default: 6 lines for semantic results, 3+3 for grep matches
colgrep -e "async" "error handling" .

# Custom: 10 lines for semantic, 5+5 for grep
colgrep -e "async" "error handling" -n 10 .

# Minimal: 2 lines for semantic, 1+1 for grep
colgrep -e "async" "error handling" -n 2 .

The -n value controls:

Semantic results: First N lines of each matched function
Grep matches: N/2 lines before and after each exact match

Selective Indexing

When using filters (--include or -e), only matching files are indexed. This makes searching in large codebases fast even without a pre-built index:

# Only indexes .py files, not the entire codebase
colgrep --include="*.py" "database query" /large/project

# Only indexes files containing "async", skips everything else
colgrep -e "async" "error handling" /large/project

# Intersection: only indexes .ts files that contain "fetch"
colgrep -e "fetch" --include="*.ts" "API call" /large/project

Indexing behavior by filter:

Filters	Files Indexed
None	All supported files
`--include="*.py"`	Only `.py` files
`-e "pattern"`	Only files containing pattern
Both	Intersection (files matching both)

Benefits:

Search immediately in large codebases without full indexing
Index grows incrementally as you search different file types
Already-indexed files are skipped (content hash check)

Code-Only Mode

Use --code-only to exclude text and configuration files from search results, focusing only on actual code:

# Search only code files, skip markdown, yaml, json, etc.
colgrep --code-only "authentication logic" .

# Combine with other flags
colgrep --code-only -k 20 "error handling" .
colgrep --code-only --include="*.py" "database" .

Files excluded by --code-only:

Category	File Types
Documentation	Markdown, Plain text, AsciiDoc, Org
Configuration	YAML, TOML, JSON, Dockerfile, Makefile
Shell scripts	Shell (.sh, .bash, .zsh), PowerShell

This is useful when searching for implementation details without results from documentation, config files, or scripts cluttering the output.

Status

colgrep status

Example Output

$ colgrep "encode documents with ColBERT"

1. encode_documents (score: 10.100)
   → src/lib.rs:680
   pub fn encode_documents(

2. Colbert (score: 10.067)
   → src/lib.rs:454
   pub struct Colbert {

3. encode_queries (score: 10.066)
   → src/lib.rs:718
   pub fn encode_queries(&self, queries: &[&str]) -> Result<Vec<Array2<f32>>> {

JSON Output

$ colgrep "control flow" -k 1 --json

[
  {
    "unit": {
      "name": "extract_control_flow",
      "file": "src/parser/mod.rs",
      "line": 449,
      "language": "rust",
      "unit_type": "function",
      "signature": "fn extract_control_flow(node: Node, lang: Language) -> (usize, bool, bool, bool)",
      "docstring": null,
      "calls": ["children", "kind", "visit", "walk"],
      "called_by": ["extract_function"],
      "complexity": 4,
      "has_loops": true,
      "has_branches": true,
      "has_error_handling": false,
      "variables": [
        "complexity",
        "has_branches",
        "has_error_handling",
        "has_loops"
      ],
      "imports": [],
      "code": "fn extract_control_flow(...) {\n    let mut complexity = 1;\n    ..."
    },
    "score": 5.44
  }
]

5-Layer Code Analysis

Each code unit (function, method, class) is analyzed across 5 layers:

Layer	Data Extracted	Example
1. AST	Signature, docstring, parameters, return type	`fn foo(x: i32) -> String`
2. Call Graph	Functions called, functions that call this	`calls: [bar, baz]`, `called_by: [main]`
3. Control Flow	Complexity, loops, branches, error handling	`complexity: 5, has_loops: true`
4. Data Flow	Variables defined	`variables: [result, temp, config]`
5. Dependencies	Imports used	`imports: [serde, tokio]`
+ File Path	Normalized path for embedding + original filename	`project / src / utils / parser parser.rs`

This rich context enables semantic understanding beyond simple text matching.

Embedding Text Example

Here's an example of the text representation sent to the ColBERT model for encoding. This shows how all 5 layers are combined into a single searchable document:

Function: search
Signature: pub fn search(&self, query: &str, top_k: usize, subset: Option<&[i64]>) -> Result<Vec<SearchResult>>
Description: Search the index with an optional filtered subset
Parameters: self, query, top_k, subset
Returns: Result<Vec<SearchResult>>
Calls: encode_queries, search, get, to_vec, context, iter, zip, filter_map, collect
Called by: cmd_search
Control flow: complexity=3, has_branches
Variables: query_embeddings, query_emb, params, results, doc_ids, metadata, search_results
Uses: next_colgrep, serde_json, anyhow
Code:
pub fn search(&self, query: &str, top_k: usize, subset: Option<&[i64]>) -> Result<Vec<SearchResult>> {
    let query_embeddings = self.model.encode_queries(&[query])?;
    ...
}
File: next colgrep cli / src / index / mod mod.rs

This structured format allows the model to understand:

What the code does (signature, description)
How it works (control flow, variables)
Where it fits (calls, called_by, imports)
Location in the codebase (file path)

The file path is processed for better embedding quality:

Shortened to include only the filename and up to 3 parent directories
Path separators (/, \) are surrounded by spaces and normalized to /
Underscores, hyphens, and dots are replaced with spaces
CamelCase is split into separate words (e.g., MyClass → my class)
The entire path is lowercased
The original filename is appended at the end for exact matching

This normalization helps the embedding model better understand path components as separate semantic tokens.

Supported Languages

Code Languages (with tree-sitter parsing)

Language	Extensions
Python	`.py`
TypeScript	`.ts`, `.tsx`
JavaScript	`.js`, `.jsx`, `.mjs`
Go	`.go`
Rust	`.rs`
Java	`.java`
C	`.c`, `.h`
C++	`.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx`
Ruby	`.rb`
C#	`.cs`
Kotlin	`.kt`, `.kts`
Swift	`.swift`
Scala	`.scala`, `.sc`
PHP	`.php`
Lua	`.lua`
Elixir	`.ex`, `.exs`
Haskell	`.hs`
OCaml	`.ml`, `.mli`

Text & Documentation

Format	Extensions
Markdown	`.md`, `.markdown`
Plain Text	`.txt`, `.text`, `.rst`
AsciiDoc	`.adoc`, `.asciidoc`
Org	`.org`

Configuration Files

Format	Extensions / Files
YAML	`.yaml`, `.yml`
TOML	`.toml`
JSON	`.json`
Dockerfile	`Dockerfile`
Makefile	`Makefile`, `GNUmakefile`

Shell Scripts

Format	Extensions
Shell	`.sh`, `.bash`, `.zsh`
PowerShell	`.ps1`

Text, documentation, configuration files, and shell scripts are indexed as a single document per file.

Ignored Directories

The following directories are always ignored (even without .gitignore):

Category	Ignored
Version Control	`.git`, `.svn`, `.hg`
Dependencies	`node_modules`, `vendor`, `third_party`, `external`
Build Outputs	`target`, `build`, `dist`, `out`, `bin`, `obj`
Python	`__pycache__`, `.venv`, `venv`, `.env`, `.tox`, `.pytest_cache`, `.mypy_cache`, `*.egg-info`
JavaScript	`.next`, `.nuxt`, `.cache`, `.parcel-cache`, `.turbo`
Java	`.gradle`, `.m2`
IDE/Editor	`.idea`, `.vscode`, `.vs`, `.xcworkspace`, `.xcodeproj`
Coverage	`coverage`, `.coverage`, `htmlcov`, `.nyc_output`
Misc	`.colgrep`, `tmp`, `temp`, `logs`, `.DS_Store`

Additionally, all patterns in .gitignore are respected.

File Size Limit

Files larger than 512KB are automatically skipped during indexing. This prevents memory issues with very large generated files, minified bundles, or data files.

When files are skipped, the indexing output shows:

⊘ 3 files skipped (too large, >512KB)

Common files that may be skipped:

Minified JavaScript bundles (bundle.min.js)
Large generated files
Data files accidentally given code extensions
Vendored dependencies

Model

By default, uses lightonai/GTE-ModernColBERT-v1-onnx with INT8 quantization for fast inference. The model is automatically downloaded on first use. Use colgrep config --fp32 to switch to full-precision mode (see Configuration).

Using a Different Model

Use a different model for a single query:

colgrep "query" --model path/to/local/model
colgrep "query" --model organization/model-name

Switching Default Model

Change the default model permanently:

# Set a new default model
colgrep set-model lightonai/another-colbert-model

# The new model is validated before switching
# Old indexes are automatically cleared (they're incompatible)

Your model preference is stored in ~/.config/colgrep/config.json.

Index Storage

Indexes are stored in a centralized location following the XDG Base Directory specification:

Platform	Location
Linux	`~/.local/share/colgrep/indices/`
macOS	`~/Library/Application Support/colgrep/indices/`
Windows	`C:\Users\<user>\AppData\Roaming\colgrep\indices\`

Each project gets its own subdirectory named {project-name}-{8-char-hash}:

{project-name}-{hash}/
├── index/          # PLAID vector index
│   └── metadata.json
├── state.json      # File hashes for incremental updates
└── project.json    # Project path and metadata

Parent Index Detection

When searching in a subdirectory of an already-indexed project, the CLI automatically uses the parent index instead of creating a new one:

# If /my/project is already indexed...
cd /my/project/src/utils
colgrep "helper function"   # Uses /my/project's index automatically

Clearing Indexes

# Clear index for current project
colgrep clear

# Clear all indexes
colgrep clear --all

How It Works

Parse: Tree-sitter extracts functions, methods, and classes from source files
Analyze: 5-layer analysis extracts rich structural information
Embed: ColBERT encodes each unit as multiple vectors (one per token)
Index: PLAID algorithm compresses and indexes the vectors
Search: Query is encoded and matched using late interaction scoring

Hardware Acceleration

Enable GPU support when building:

# NVIDIA CUDA
cargo install --path . --features cuda

# Apple CoreML
cargo install --path . --features coreml

Configuration

Config Command

View and modify configuration settings:

# Show current configuration
colgrep config

# Set default number of results
colgrep config --k 20

# Set default context lines
colgrep config --n 10

# Use full-precision (FP32) model instead of INT8 quantized
colgrep config --fp32

# Switch back to INT8 quantized model (default, faster)
colgrep config --int8

# Reset to defaults (use 0)
colgrep config --k 0 --n 0

Model Precision

By default, colgrep uses INT8 quantized models for faster inference with minimal quality loss. You can switch to full-precision (FP32) if needed:

Mode	Flag	Description
INT8 (default)	`--int8`	~2x faster inference, smaller model size
FP32	`--fp32`	Full precision, slightly better accuracy

Note: When switching precision, clear existing indexes with colgrep clear --all since embeddings are generated with different model weights.

Config File

User preferences are stored in ~/.config/colgrep/config.json. Only non-default values are saved:

{
  "default_model": "lightonai/GTE-ModernColBERT-v1-onnx",
  "fp32": true,
  "default_k": 20,
  "default_n": 10
}

Defaults (when not specified): k=15, n=6, fp32=false (INT8)

Environment Variables

Variable	Description
`ORT_DYLIB_PATH`	Path to ONNX Runtime library (overrides auto-detection)
`CONDA_PREFIX`	Used for finding Python environments

Dependencies

~403MB
~11M SLoC