xf

xf - Ultra-fast CLI for searching your X data archive

Ultra-fast CLI for searching and querying your X data archive with sub-millisecond latency.

Quick Install

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash

Or via package managers:

# macOS/Linux (Homebrew)
brew install dicklesworthstone/tap/xf

# Windows (Scoop)
scoop bucket add dicklesworthstone https://github.com/Dicklesworthstone/scoop-bucket
scoop install dicklesworthstone/xf

Works on Linux, macOS, and Windows. Auto-detects your platform and downloads the right binary.

🤖 Agent Quickstart (JSON)

Use --format json in agent contexts. stdout = data, stderr = diagnostics, exit 0 = success.

# 1) Index once (required before search)
xf index ~/x-archive

# 2) Search (machine-readable)
xf search "machine learning" --format json --limit 20

# 3) Archive stats (machine-readable)
xf stats --format json

TL;DR

The Problem: X lets you download all your data, but actually finding anything in that archive is painful. The built-in HTML viewer is slow and clunky, there's no real search, and your data is scattered across separate files.

The Solution: xf indexes your X (formerly Twitter) data export and provides blazingly fast full-text search across tweets, likes, DMs, and Grok conversations—all from the command line.

Why Use xf?

Feature	What It Does
Sub-Millisecond Search	Tantivy-powered full-text search with BM25 ranking
Vector Similarity (default: hash)	Finds content with overlapping vocabulary; best when queries share words with target content
True Semantic Search (optional ML)	Uses MiniLM embeddings when indexed with `--semantic` for synonym-level matching
Hybrid Search	Combines keyword + vector similarity (hash by default, ML when indexed with `--semantic`)
Search Everything	Tweets, likes, DMs, and Grok conversations in one place
Rich Query Syntax	Phrases, wildcards, boolean operators (`AND`, `OR`, `NOT`)
DM Context	View full conversation threads with search matches highlighted
Multiple Formats	JSON, CSV, compact, or colorized terminal output
Privacy-First	All data stays local on your machine—nothing sent anywhere
Fast Indexing	~10,000 documents/second with parallel parsing

Note: Semantic mode uses hash-based vocabulary similarity by default. Run xf index --semantic to build true semantic embeddings (MiniLM). If you switch modes, re-index so the vector index matches the embedder.

Quick Example

# Index your archive (default: hash-based embeddings)
$ xf index ~/x-archive

# Optional: true semantic embeddings (downloads ~80MB on first use)
$ xf index ~/x-archive --semantic

# Search across everything (hybrid mode by default)
$ xf search "machine learning"

# Semantic search (vector similarity; true semantic if indexed with --semantic)
$ xf search "feeling overwhelmed at work" --mode semantic

# Keyword-only search (classic BM25)
$ xf search "rust async" --mode lexical

# Search only your DMs with full conversation context
$ xf search "meeting tomorrow" --types dm --context

# Export results as JSON
$ xf search "rust async" --format json --limit 50

Prepared Blurb for AGENTS.md Files:

## xf — X Archive Search

Ultra-fast local search for X (Twitter) data archives. Parses `window.YTD.*` JavaScript format from X data exports. Hybrid search combining keyword (BM25) + vector similarity (hash by default; ML when indexed with `--semantic`) via RRF fusion.

### Core Workflow

```bash
# 1. Index archive (one-time, ~5-30 seconds)
xf index ~/x-archive
xf index ~/x-archive --force          # Rebuild from scratch
xf index ~/x-archive --only tweet,dm  # Index specific types
xf index ~/x-archive --skip grok      # Skip specific types
xf index ~/x-archive --semantic       # True semantic embeddings (MiniLM; slower)

# 2. Search
xf search "machine learning"          # Hybrid search (default)
xf search "feeling stressed" --mode semantic  # Vector similarity (hash default, ML if indexed with --semantic)
xf search "rust async" --mode lexical # Keyword-only (BM25)
xf search "meeting" --types dm        # DMs only
xf search "article" --types like      # Liked tweets only

Search Modes

--mode hybrid   # Default: BM25 + vector similarity (hash default, ML with --semantic index)
--mode lexical  # Keyword-only (BM25), best for exact terms
--mode semantic # Vector similarity (hash default, ML with --semantic index)

Search Syntax (lexical mode)

xf search "exact phrase"              # Phrase match (quotes matter)
xf search "rust AND async"            # Boolean AND
xf search "python OR javascript"      # Boolean OR
xf search "python NOT snake"          # Exclusion
xf search "rust*"                     # Wildcard prefix

Key Flags

--format json                         # Machine-readable output (use this!)
--format csv                          # Spreadsheet export
--limit 50                            # Results count (default: 20)
--offset 20                           # Pagination
--context                             # Full DM conversation thread (--types dm only)
--since "2024-01-01"                  # Date filter (supports natural language)
--until "last week"                   # Date filter
--sort date|date_desc|relevance|engagement

Other Commands

xf stats                              # Archive overview (counts, date range)
xf stats --detailed                   # Full analytics (temporal, engagement, content)
xf stats --format json                # Machine-readable stats
xf tweet <id>                         # Show specific tweet by ID
xf tweet <id> --engagement            # Include engagement metrics
xf list tweets --limit 20             # Browse indexed tweets
xf list dms                           # Browse DM conversations
xf doctor                             # Health checks (archive, DB, index)
xf shell                              # Interactive REPL

Data Types

tweet (your posts), like (liked tweets), dm (direct messages), grok (AI chats), follower, following, block, mute

Storage

- Database: ~/.local/share/xf/xf.db (override: XF_DB env)
- Index: ~/.local/share/xf/xf_index/ (override: XF_INDEX env)
- Archive format: Expects data/ directory with tweets.js, like.js, direct-messages.js, etc.

Notes

- First search after restart may be slower (index loading). Subsequent searches <10ms.
- Semantic mode uses hash-based similarity by default. Run `xf index --semantic` for true semantic embeddings.
- --context only works with --types dm — shows full conversation around matches.
- All data stays local. No network access during search; optional model download only when you enable `--semantic`.

Design Philosophy

xf is built around several core principles that inform every design decision:

Local-First, Privacy-Always

Your social media history is deeply personal. xf processes everything locally:

No network calls during search: Zero telemetry, no analytics, no "phone home" (optional model download only if you enable --semantic)
No cloud dependencies: Works completely offline after installation
No API keys: Unlike tools that query X's API, xf works entirely from your downloaded archive
Your data stays yours: The SQLite database and search index live on your machine

Zero-Configuration Similarity

Getting started should take seconds, not hours:

Sensible defaults: Hybrid search, 20 results, colorized output—just works
Auto-detection: Finds archive structure automatically, handles format variations
No model downloads by default: The hash embedder means no waiting for ML model files (unless you opt into --semantic)
Platform detection: Install script handles OS/architecture differences

Composition Over Complexity

xf is designed to play well with Unix philosophy:

# Pipe to jq for custom JSON processing
xf search "machine learning" --format json | jq '.[] | .text'

# Count tweets by year
xf search "coffee" --format json --limit 1000 | jq -r '.[].created_at[:4]' | sort | uniq -c

# Export to clipboard (macOS)
xf tweet 1234567890 --format json | pbcopy

# Feed into other tools
xf search "interesting" --types like --format json | ./my-analysis-script.py

Speed as a Feature

Performance isn't an afterthought—it's a core feature:

Sub-millisecond lexical search: Faster than you can blink
Memory-mapped indices: OS-level caching, minimal RAM overhead
Parallel everything: Parsing, indexing, embedding generation
Lazy initialization: Pay only for what you use

How xf Compares

Feature	xf	X's HTML Viewer	grep/ripgrep	Elasticsearch
Full-text search	✅ BM25 + vector similarity (hash default; ML optional)	❌ None	⚠️ Basic regex	✅ Full
Similarity search	✅ Hash embedder	❌	❌	⚠️ With plugins
Search speed	✅ <10ms	❌ Manual scrolling	⚠️ Depends on size	✅ Fast
Setup time	✅ ~10 seconds	✅ Just open HTML	✅ None	❌ Hours
Dependencies	✅ Single binary	✅ Browser	✅ None	❌ JVM, config
Offline use	✅ Fully offline	✅	✅	⚠️ Usually
Privacy	✅ 100% local	✅	✅	⚠️ Depends
DM search	✅ With context	❌	⚠️ Raw files	✅ If indexed
Date filtering	✅ Natural language	❌	❌	✅
Export formats	✅ JSON/CSV/text	❌	⚠️ Raw text	✅

When to use xf:

You want fast, comprehensive search across your entire archive
You value privacy and want everything local
You want similarity search without cloud APIs
You prefer CLI tools that compose with Unix pipelines

When xf might not be ideal:

You only need to find one specific tweet (just Ctrl+F in the HTML viewer)
You need real-time access to X (use the app/website)
You want collaborative features (xf is single-user by design)

Origins & Authors

This project was created by Jeffrey Emanuel after realizing that X's data export, while comprehensive, lacks any useful search functionality.

Jeffrey Emanuel - Creator and maintainer

Getting Your X Data Archive

Before using xf, you need to download your data from X. Here's the complete process:

Step 1: Request Your Archive

Log into X at x.com or twitter.com
Navigate to Settings:
- Click "More" (...) in the left sidebar
- Select "Settings and Support" -> "Settings and privacy"
- Or go directly to: x.com/settings/download_your_data
Request your archive:
- Under "Download an archive of your data", click "Request archive"
- You may need to verify your identity (password, 2FA)
- Select what data you want (recommend "All data" for complete archive)

Step 2: Wait for Processing

X needs time to compile your archive:

Typical wait time: 24-48 hours (can be longer for large accounts)
You'll receive an email notification when it's ready
You can also check the same settings page for status updates
The link expires after a few days, so download promptly!

Step 3: Download and Extract

Download: Click the link in your email or on the settings page
- File will be named something like twitter-2026-01-09-abc123.zip
- Size varies: typically 50MB to several GB depending on your activity and media

Extract: Unzip the archive to a folder

unzip twitter-2026-01-09-abc123.zip -d ~/x-archive

What's Inside the Archive

Your extracted archive contains:

x-archive/
├── Your archive.html      # Browser viewer (open this to explore manually)
├── data/
│   ├── tweets.js          # All your tweets
│   ├── like.js            # Tweets you've liked
│   ├── direct-messages.js # DM conversations
│   ├── follower.js        # Your followers
│   ├── following.js       # Accounts you follow
│   ├── grok-chat-item.js   # Grok AI chats (if any)
│   ├── account.js         # Account info
│   ├── profile.js         # Profile data
│   └── ...                # Many other data files
└── assets/
    └── images/            # Media files (can be large!)

The data files use a JavaScript format like:

window.YTD.tweets.part0 = [
  { "tweet": { "id": "123...", "full_text": "Hello world!", ... } },
  ...
]

xf knows how to parse this format and extract all your content.

⚠️ Important: What's NOT in Your Archive

Your X data archive only contains your own data—content you created or directly interacted with. This is a limitation of X's export, not xf.

What IS included:

Data Type	Description
Your tweets	Everything you posted (including replies you made to others)
Your likes	Tweets you liked (with full text if available)
Your DMs	Direct message conversations you participated in
Your Grok chats	Conversations with Grok AI
Followers/Following	Lists of accounts (usernames only, not their tweets)

What is NOT included:

Data Type	Why It's Missing
Replies to your tweets	Other people's replies are their data, not yours
Quote tweets of you	Same reason—belongs to whoever quoted you
Mentions of you	Tweets mentioning @you are owned by others
Others' tweets	You only get tweets you liked, not random tweets you viewed
Analytics/impressions	Detailed view counts aren't in the standard export

Why this matters: If you're hoping to find "what did people say in response to my tweet about X?"—that data isn't in your archive. You'd need to use the X API or third-party tools to fetch replies in real-time.

What you CAN do:

Search your own replies to others: xf search "query" --replies-only
Find conversations in your DMs: xf search "topic" --types dm --context
See tweets you engaged with via likes: xf search "topic" --types like

Installation

Quick Install (Recommended)

Recommended: Homebrew (macOS/Linux)

brew install dicklesworthstone/tap/xf

Windows: Scoop

scoop bucket add dicklesworthstone https://github.com/Dicklesworthstone/scoop-bucket
scoop install dicklesworthstone/xf

Alternative: Install Script The easiest way to install without a package manager is using the install script, which downloads a prebuilt binary for your platform:

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash

With options:

Easy mode (auto-update PATH in shell rc files):

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash -s -- --easy-mode

Install specific version:

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash -s -- --version v0.1.0

Install to /usr/local/bin (system-wide, requires sudo):

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | sudo bash -s -- --system

Build from source instead of downloading binary:

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash -s -- --from-source

Note: If you have gum installed, the installer will use it for fancy terminal formatting.

The install script:

Automatically detects your OS and architecture
Downloads the appropriate prebuilt binary
Verifies SHA256 checksums for security
Falls back to building from source if no prebuilt is available
Offers to update your PATH

From Source (requires Rust nightly)

This project uses Rust Edition 2024 features and requires the nightly toolchain. The repository includes a rust-toolchain.toml that automatically selects the correct toolchain.

# Install Rust nightly if you don't have it
rustup install nightly

# Install directly from GitHub
cargo +nightly install --git https://github.com/Dicklesworthstone/xf.git

Manual Build

git clone https://github.com/Dicklesworthstone/xf.git
cd xf
# rust-toolchain.toml automatically selects nightly
cargo build --release
cp target/release/xf ~/.local/bin/

Prebuilt Binaries

Prebuilt binaries are available for:

Linux x86_64 (x86_64-unknown-linux-gnu)
Linux ARM64 (aarch64-unknown-linux-gnu)
macOS Intel (x86_64-apple-darwin)
macOS Apple Silicon (aarch64-apple-darwin)

Download from GitHub Releases and verify the SHA256 checksum.

Quick Start

1. Index your archive

xf index ~/x-archive

This parses all your data and builds a searchable index. On a typical archive, this takes 10-30 seconds.

2. Search!

# Basic search
xf search "machine learning"

# Search only tweets
xf search "python" --types tweet

# Search DMs
xf search "meeting" --types dm

# Search likes
xf search "interesting article" --types like

# JSON output
xf search "rust" --format json

# Limit results
xf search "AI" --limit 5

Commands

`xf index <archive_path>`

Index an X data archive.

xf index ~/Downloads/x-archive

# Force re-index (clear existing data)
xf index ~/Downloads/x-archive --force

# Build true semantic embeddings (MiniLM; downloads ~80MB on first use)
xf index ~/Downloads/x-archive --semantic

# Index only specific data types
xf index ~/Downloads/x-archive --only tweet,like

# Skip certain data types
xf index ~/Downloads/x-archive --skip dm,grok

`xf search <query>`

Search the indexed archive.

# Basic search (hybrid mode by default)
xf search "your query"

# Search modes
xf search "query" --mode hybrid    # Default: combines keyword + vector similarity (hash default; ML if indexed with --semantic)
xf search "query" --mode lexical   # Keyword-only (BM25)
xf search "query" --mode semantic  # Vector similarity (hash default; ML if indexed with --semantic)

# Filter by type
xf search "query" --types tweet,dm

# Pagination
xf search "query" --limit 20 --offset 40

# Output formats
xf search "query" --format json
xf search "query" --format csv
xf search "query" --format compact

# DM context: show full conversation with matches highlighted
xf search "meeting" --types dm --context
xf search "meeting" --types dm --context --format json

Search Modes:

Mode	Best For	How It Works
`hybrid`	General use (default)	Combines keyword + vector similarity (hash default; ML with `--semantic`)
`lexical`	Exact terms, boolean queries	Classic BM25 keyword matching
`semantic`	Similar wording	Vector similarity (hash default; ML with `--semantic`)

Query syntax:

Simple terms: machine learning
Phrases: "exact phrase"
Boolean: rust AND async
Exclusion: python NOT snake

`xf stats`

Show archive statistics.

xf stats

# JSON output
xf stats --format json

# Detailed breakdown
xf stats --detailed

`xf tweet <id>`

Show details for a specific tweet.

xf tweet 1234567890

# Show engagement metrics
xf tweet 1234567890 --engagement

`xf config`

Manage configuration.

# Show current config
xf config --show

`xf update`

Check for updates.

xf update

`xf completions <shell>`

Generate shell completions.

# Bash
xf completions bash > ~/.local/share/bash-completion/completions/xf

# Zsh
xf completions zsh > ~/.zfunc/_xf

# Fish
xf completions fish > ~/.config/fish/completions/xf.fish

Output Formats

Format	Description
`text`	Human-readable with colors (default)
`json`	Compact JSON
`json-pretty`	Pretty-printed JSON
`csv`	Comma-separated values
`compact`	One result per line

Data Types

Type	Description
`tweet`	Your tweets
`like`	Tweets you've liked
`dm`	Direct messages
`grok`	Grok AI conversations
`follower`	Your followers
`following`	Accounts you follow
`block`	Blocked accounts
`mute`	Muted accounts

Storage Locations

By default, xf stores data in:

Platform	Location
macOS	`~/Library/Application Support/xf/`
Linux	`~/.local/share/xf/`
Windows	`%LOCALAPPDATA%\xf\`

Override with environment variables:

XF_DB: Path to SQLite database
XF_INDEX: Path to search index directory

Data Model

What Gets Indexed

Each document type has specific fields indexed for search:

Tweets

Field	Indexed	Stored	Notes
`id`	✅ Term	✅	Tweet ID for lookup
`full_text`	✅ Full-text	✅	Main search content
`created_at`	✅ Date	✅	For date filtering
`favorite_count`	❌	✅	Likes received
`retweet_count`	❌	✅	Retweets received
`in_reply_to_status_id`	✅ Term	✅	For thread detection
`hashtags`	❌	✅	Extracted from text
`mentions`	❌	✅	@usernames mentioned
`urls`	❌	✅	Expanded URLs
`media`	❌	✅	Media attachments

Likes

Field	Indexed	Stored	Notes
`tweet_id`	✅ Term	✅	Liked tweet's ID
`full_text`	✅ Full-text	✅	If available in export
`expanded_url`	❌	✅	Link to original

Direct Messages

Field	Indexed	Stored	Notes
`id`	✅ Term	✅	Message ID
`conversation_id`	✅ Term	✅	For grouping context
`text`	✅ Full-text	✅	Message content
`sender_id`	✅ Term	✅	Who sent it
`recipient_id`	❌	✅	Who received it
`created_at`	✅ Date	✅	Timestamp

Grok Conversations

Field	Indexed	Stored	Notes
`chat_id`	✅ Term	✅	Conversation ID
`message`	✅ Full-text	✅	Message content
`sender`	✅ Term	✅	"user" or "grok"
`created_at`	✅ Date	✅	Timestamp

Embedding Strategy

All content is stored and indexed in full—nothing is truncated. For vector embeddings, text is canonicalized (Unicode normalization, markdown stripped, whitespace collapsed) before embedding (hash or ML).

Type	Text Source	Notes
Tweet	`full_text`	Full content including long-form tweets
Like	`full_text`	If available from archive
DM	`text`	Full message text
Grok	`message`	Full response text

Empty or trivial messages (e.g., "OK", "Thanks") are filtered from embeddings but still searchable via keyword search.

Security & Privacy

Your Data Never Leaves Your Machine

xf is designed with privacy as a non-negotiable requirement:

┌─────────────────────────────────────────────────────────────┐
│                     YOUR MACHINE                            │
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │  X Archive  │───▶│  xf binary  │───▶│  Local DB   │      │
│  │  (input)    │    │  (process)  │    │  (output)   │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│                                                             │
│  ❌ No network calls                                        │
│  ❌ No telemetry                                            │
│  ❌ No cloud sync                                           │
│  ❌ No API keys required                                    │
└─────────────────────────────────────────────────────────────┘

What's Stored Where

Location	Contents	Sensitive?
`~/.local/share/xf/xf.db`	Full tweet text, DMs, metadata	⚠️ Yes
`~/.local/share/xf/xf_index/`	Tokenized search index	⚠️ Yes (reversible)
Embeddings (in DB)	Numerical vectors	Low (hard to reverse)

Recommendations:

Encrypt your disk: Use full-disk encryption (FileVault, LUKS, BitLocker)
Secure permissions: The database is created with user-only permissions (0600)
Backup carefully: When backing up, treat xf's data directory as sensitive
Delete when done: rm -rf ~/.local/share/xf/ removes all indexed data

No Network Access

xf makes zero network calls during normal search operations:

No update checks: Use xf update explicitly when you want to update
No telemetry: No usage stats, no error reporting, no analytics
No model downloads by default: The hash embedder is pure Rust (unless you opt into xf index --semantic)
No API calls: Works entirely from your local archive export

The only network access is during:

Installation: Downloading the binary from GitHub Releases
xf update: Checking for and downloading updates (user-initiated)
Optional semantic indexing: Downloading the MiniLM model when you run xf index --semantic

Secure Deletion

To completely remove all xf data:

# Remove database and index
rm -rf ~/.local/share/xf/

# Or on macOS
rm -rf ~/Library/Application\ Support/xf/

# Remove the binary
rm ~/.local/bin/xf
# or
rm /usr/local/bin/xf

This permanently deletes all indexed content. The original archive is unaffected.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       X Data Archive                            │
│   (tweets.js, like.js, direct-messages.js, etc.)                │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Parser (parser.rs)                           │
│   Handles window.YTD.* JavaScript format with rayon parallelism │
└─────────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ SQLite           │ │ Tantivy          │ │ Vector Index     │
│ (storage.rs)     │ │ (search.rs)      │ │ (vector.rs)      │
│ - Metadata       │ │ - Full-text      │ │ - Embeddings     │
│ - Statistics     │ │ - BM25 ranking   │ │ - SIMD search    │
│ - FTS5 fallback  │ │ - Phrase queries │ │ - F16 storage    │
│ - Tweet lookup   │ │ - Boolean ops    │ │ - Cosine sim     │
└──────────────────┘ └──────────────────┘ └──────────────────┘
        │                   │                   │
        │                   ▼                   │
        │         ┌──────────────────┐          │
        │         │ Hybrid Fusion    │◀─────────┘
        │         │ (hybrid.rs)      │
        │         │ - RRF algorithm  │
        │         │ - Score fusion   │
        │         └────────┬─────────┘
        │                  │
        └────────┬─────────┘
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      CLI (cli.rs)                               │
│   clap-based command parsing with rich colored output           │
└─────────────────────────────────────────────────────────────────┘

Processing Pipeline

Stage 1: Archive Parsing

Reads JavaScript files from the archive's data/ directory
Strips window.YTD.<type>.part0 = prefix to extract JSON
Uses rayon for parallel parsing of large files

Stage 2: Storage

Normalizes data into structured models (Tweet, Like, DirectMessage, etc.)
Stores in SQLite with FTS5 virtual tables for fallback search
Maintains statistics and metadata

Stage 3: Keyword Indexing

Feeds content to Tantivy search engine
Creates inverted index with BM25 scoring
Supports prefix queries via edge n-grams

Stage 4: Embedding Generation

Canonicalizes text (strips markdown, normalizes whitespace, filters noise)
Generates 384-dimensional embeddings via:
- Default: FNV-1a hash embedder (fast, zero external dependencies)
- Optional: MiniLM via FastEmbed when indexed with --semantic (true semantic, slower)
Stores embeddings with F16 quantization (50% size reduction)
Content hashing (SHA256) enables incremental re-indexing

Stage 5: Search

Lexical mode: Tantivy BM25 keyword matching
Semantic mode: Vector similarity via SIMD dot product (hash or ML embeddings)
Hybrid mode: RRF fusion of both result sets for optimal relevance
Joins with SQLite for full metadata retrieval

Search Algorithms

xf implements three distinct search strategies, each optimized for different use cases:

Lexical Search (BM25)

The classic information retrieval approach, powered by Tantivy:

Algorithm: BM25 (Best Match 25) with saturation term frequency
Strengths: Exact keyword matching, phrase queries, boolean operators
Use case: When you know the exact words you're looking for

xf search "async await" --mode lexical

Semantic Search (Hash or ML Embeddings)

xf supports two semantic embedding modes that share the same vector index format:

A) Default: Hash-Based Vector Similarity
Finds content with overlapping vocabulary rather than exact keyword matches:

Embedder: FNV-1a hash-based embeddings (zero external dependencies)
Dimensions: 384-dimensional vectors
Similarity: Cosine similarity via SIMD-accelerated dot product
Storage: F16 quantization reduces memory by 50%

# Hash-based similarity (default index)
xf search "feeling overwhelmed at work" --mode semantic

How the Hash Embedder Works:

Unlike neural network embedders (Word2Vec, BERT), xf uses a deterministic hash-based approach:

Tokenize: Split text on word boundaries
Hash: FNV-1a 64-bit hash for each token
Project: Hash determines vector index (hash % 384) and sign (MSB)
Normalize: L2 normalization for cosine similarity

This approach is:

Fast: ~0ms per embedding (no GPU needed)
Deterministic: Same input always produces same output
Zero dependencies: No model files to download

B) Optional: True Semantic (MiniLM via FastEmbed)
When you index with --semantic, xf builds MiniLM embeddings for synonym-level matching:

# Build ML embeddings (downloads ~80MB on first use)
xf index ~/x-archive --semantic

# True semantic similarity
xf search "feeling overwhelmed at work" --mode semantic

This mode is:

Semantic: "happy" and "joyful" can match
Slower to index: ~100 items/sec on CPU
Larger downloads: ~80MB model weights on first use

Hybrid Search (RRF Fusion)

Combines the best of both approaches using Reciprocal Rank Fusion:

                     User Query
                         │
           ┌─────────────┴─────────────┐
           ▼                           ▼
    ┌──────────────┐           ┌──────────────┐
    │   Tantivy    │           │   Vector     │
    │   (BM25)     │           │  (Cosine)    │
    └──────┬───────┘           └──────┬───────┘
           │ Rank 0,1,2...            │ Rank 0,1,2...
           │                          │
           └────────────┬─────────────┘
                        ▼
                ┌───────────────┐
                │  RRF Fusion   │
                │  K=60         │
                └───────┬───────┘
                        ▼
                  Final Results

RRF Algorithm:

Score(doc) = Σ 1/(K + rank + 1)

Where:

K = 60: Empirically optimal constant that balances score distribution
rank: 0-indexed position in each result list
Documents appearing in both lists get scores from both, naturally boosting multi-signal matches

Why RRF?

Score normalization: BM25 scores (0-20+) and cosine similarity (0-1) are incompatible. RRF uses ranks, not scores.
Robust fusion: Outperforms simple score averaging or max-pooling
No tuning needed: K=60 works well across diverse datasets
Deterministic: Tie-breaking by doc ID ensures consistent ordering

# Default mode—best of both worlds
xf search "productivity tips"

Text Canonicalization

Before embedding, text passes through a normalization pipeline:

Unicode NFC: Normalize composed characters
Strip Markdown: Remove **bold**, *italic*, [links](url), headers
Collapse Code Blocks: Keep first 20 + last 10 lines of code
Normalize Whitespace: Collapse runs of spaces/newlines
Filter Low-Signal: Skip trivial content ("OK", "Thanks", "Done")
Truncate: Cap at 2000 characters for consistent embedding dimensions

This ensures semantically equivalent text produces identical embeddings.

Real-World Recipes

Here are practical examples for common tasks:

Finding That Tweet You Vaguely Remember

# You remember talking about "that one coffee shop in Brooklyn"
xf search "coffee brooklyn" --mode hybrid

# You remember the vibe but not the words
xf search "cozy morning routine" --mode semantic

# Combine with date if you remember roughly when
xf search "vacation" --since "2023-06" --until "2023-09"

Analyzing Your Posting Patterns

# Most engaged tweets (by likes + retweets)
xf search "" --types tweet --sort engagement --limit 20

# Your tweets from a specific era
xf search "" --since "2020-03" --until "2020-06" --types tweet

# Detailed stats about your archive
xf stats --detailed

Exporting Data for Analysis

# Export all tweets as JSON for external processing
xf search "" --types tweet --limit 100000 --format json > all_tweets.json

# Export to CSV for spreadsheets
xf search "project" --format csv > project_tweets.csv

# Get tweets as JSONL (one per line) for streaming processing
xf search "" --types tweet --format json | jq -c '.[]' > tweets.jsonl

Searching DM Conversations

# Find DMs about a topic with full conversation context
xf search "dinner plans" --types dm --context

# Export a specific conversation thread
xf search "project update" --types dm --context --format json > project_thread.json

Scripting and Automation

# Count tweets containing "rust" by year
xf search "rust" --format json --limit 10000 | \
  jq -r '.[].created_at[:4]' | sort | uniq -c

# Find all unique hashtags you've used
xf search "" --types tweet --format json --limit 100000 | \
  jq -r '.[].text' | grep -oE '#\w+' | sort | uniq -c | sort -rn | head -20

# Daily tweet count (requires jq)
xf search "" --types tweet --format json --limit 100000 | \
  jq -r '.[].created_at[:10]' | sort | uniq -c

# Backup your indexed data
tar -czvf xf-backup.tar.gz ~/.local/share/xf/

Shell Integration

# Add to your shell aliases (~/.bashrc or ~/.zshrc)
alias xs='xf search'
alias xst='xf search --types tweet'
alias xsd='xf search --types dm --context'
alias xsl='xf search --types like'

# Function to search and copy first result
xfirst() {
  xf search "$@" --limit 1 --format json | jq -r '.[0].text'
}

# Quick stats check
alias xinfo='xf stats --format json | jq'

Technical Deep Dives

Why BM25 Over TF-IDF?

Traditional TF-IDF (Term Frequency–Inverse Document Frequency) has a flaw: term frequency grows linearly forever. A document mentioning "rust" 100 times scores 10x higher than one mentioning it 10 times—but is it really 10x more relevant?

BM25 adds saturation: after a point, additional occurrences contribute diminishing returns.

BM25 score = IDF × (tf × (k₁ + 1)) / (tf + k₁ × (1 - b + b × (docLen/avgDocLen)))

Where:

k₁ = 1.2: Controls term frequency saturation
b = 0.75: Controls document length normalization

This means:

Short tweets aren't penalized for being short
Repetitive content doesn't dominate results
Relevance better matches human intuition

Why FNV-1a for Hashing?

The embedder uses FNV-1a (Fowler–Noll–Vo) rather than cryptographic hashes:

Property	FNV-1a	SHA256	MurmurHash3
Speed	⚡ Fastest	🐢 Slow	⚡ Fast
Distribution	Good	Excellent	Excellent
Deterministic	✅ Yes	✅ Yes	⚠️ Seed-dependent
Simplicity	✅ ~10 lines	❌ Complex	⚠️ Medium

FNV-1a's key advantage: simplicity with good distribution. For embedding purposes, we need consistent hashing that spreads tokens across dimensions—not cryptographic security.

// FNV-1a in ~5 lines
const FNV_OFFSET: u64 = 0xcbf29ce484222325;
const FNV_PRIME: u64 = 0x100000001b3;

fn fnv1a(bytes: &[u8]) -> u64 {
    bytes.iter().fold(FNV_OFFSET, |hash, &byte| {
        (hash ^ u64::from(byte)).wrapping_mul(FNV_PRIME)
    })
}

Why 384 Dimensions?

The embedding dimension (384) is chosen to match common ML embedders:

MiniLM-L6: 384 dimensions
all-MiniLM-L6-v2: 384 dimensions
paraphrase-MiniLM-L6-v2: 384 dimensions

This means if you later want to swap in a neural embedder, the vector index structure remains compatible. It's also a sweet spot:

Large enough: Good representation capacity
Small enough: Fast dot products, reasonable storage
Power of 2 adjacent: 384 = 256 + 128, good for SIMD alignment

F16 Quantization Trade-offs

Embeddings are stored as 16-bit floats (F16) rather than 32-bit (F32):

Format	Size per Vector	Precision	Speed Impact
F32	1,536 bytes	Full	Baseline
F16	768 bytes	~3 decimal places	~Same
INT8	384 bytes	~2 decimal places	Faster

Why F16?

50% storage reduction: 768 bytes vs 1,536 bytes per embedding
Negligible precision loss: Cosine similarity differences < 0.001
Fast conversion: Hardware F16↔F32 conversion on modern CPUs
Good enough: Personal archives don't need INT8's extra compression

SIMD Dot Product Optimization

Vector similarity uses SIMD (Single Instruction, Multiple Data) for parallel computation:

use wide::f32x8;

pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
    let chunks = a.len() / 8;
    let mut sum = f32x8::ZERO;

    for i in 0..chunks {
        let va = f32x8::from(&a[i*8..][..8]);
        let vb = f32x8::from(&b[i*8..][..8]);
        sum += va * vb;
    }

    // Horizontal sum + handle remainder
    sum.reduce_add() + a[chunks*8..].iter()
        .zip(&b[chunks*8..])
        .map(|(x, y)| x * y)
        .sum::<f32>()
}

This processes 8 floats per instruction, achieving:

~8x throughput on supported CPUs
Portable: Uses wide crate for cross-platform SIMD
Fallback: Scalar loop for non-aligned remainders

SQLite Performance Tuning

The database uses aggressive performance settings:

PRAGMA journal_mode = WAL;      -- Write-Ahead Logging: concurrent reads
PRAGMA synchronous = NORMAL;    -- Balanced durability vs speed
PRAGMA foreign_keys = ON;       -- Referential integrity
PRAGMA cache_size = -64000;     -- 64MB page cache
PRAGMA temp_store = MEMORY;     -- Temp tables in RAM

Why WAL mode?

Readers don't block writers
Writers don't block readers
Better performance for read-heavy workloads (search is read-heavy)

Why -64000 cache?

Negative values = KB (so -64000 = 64MB)
Keeps hot pages in memory
Reduces disk I/O for repeated queries

Performance

xf is designed for speed:

Indexing (hash): ~10,000 documents/second
Indexing (semantic ML): ~100 documents/second (CPU, model-dependent)
Search: Sub-millisecond for most lexical queries; semantic adds embedding cost
Memory: Efficient memory-mapped index files
Parallelism: Multi-threaded parsing via rayon

Benchmarks

On a typical archive (12,000 tweets, 40,000 likes):

Operation	Time
Index + embed (hash)	~8 seconds
Index + embed (semantic ML)	~100 items/sec (CPU, model-dependent)
Lexical search	<1ms
Semantic search (hash)	<5ms
Semantic search (ML)	higher latency (embedding cost; model-dependent)
Hybrid search	<10ms (hash), higher with ML

Storage	Size
SQLite database	~10MB
Tantivy index	~15MB
Embeddings (F16)	~3MB

Performance Optimizations

1. Lazy Static Initialization

Regex patterns and search readers are compiled once on first use
Subsequent operations reuse compiled resources

2. Parallel Parsing

Uses rayon to parse archive files in parallel
Takes full advantage of multi-core CPUs
Automatically scales to available cores

3. Memory-Mapped Index

Tantivy uses memory-mapped files for the search index
OS manages caching automatically
Subsequent searches benefit from warm cache

4. SIMD Vector Operations

Dot products use wide crate for 8-float SIMD operations
8x theoretical throughput improvement
Portable across x86_64 and ARM64

5. F16 Quantization

Embeddings stored as 16-bit floats
50% memory reduction with negligible precision loss
Fast hardware conversion on modern CPUs

6. Content Hashing for Dedup

SHA256 hash of canonicalized text
Skip re-embedding unchanged content on re-index
Incremental updates are fast

7. Release Profile

[profile.release]
opt-level = "z"     # Optimize for size (lean binary)
lto = true          # Link-time optimization across crates
codegen-units = 1   # Single codegen unit for better optimization
panic = "abort"     # Smaller binary, no unwinding overhead
strip = true        # Remove debug symbols

Scaling Characteristics

Archive Size	Index Time	Search Time	Memory (Runtime)
1K docs	~1s	<1ms	~10MB
10K docs	~3s	<1ms	~20MB
50K docs	~10s	<5ms	~50MB
100K docs	~20s	<10ms	~100MB

Tested on M2 MacBook Pro. Times vary by CPU and disk speed.

Building from Source

Requirements:

Rust nightly (automatically selected via rust-toolchain.toml)
Git

git clone https://github.com/Dicklesworthstone/xf.git
cd xf
cargo build --release

Running Tests

cargo test

Running Benchmarks

cargo bench

Performance Corpus & Golden Outputs

xf includes a deterministic performance corpus under tests/fixtures/perf_corpus/. To regenerate it locally:

python3 scripts/generate_perf_corpus.py --seed 42 --output-dir tests/fixtures/perf_corpus

Golden outputs for isomorphism checks live in tests/fixtures/golden_outputs/ and can be refreshed with:

./scripts/verify_isomorphism.sh --update

Troubleshooting

"No archive indexed yet"

You need to run xf index before searching:

xf index ~/path/to/your/x-archive

The archive should contain a data/ directory with files like tweets.js.

"Search index missing"

The Tantivy index got corrupted or deleted. Rebuild it:

xf index ~/path/to/your/x-archive --force

Slow first search after restart

This is normal. The first search loads the index into memory (~100-500ms). Subsequent searches are <10ms. The OS caches the memory-mapped files.

No results for a query I know should match

Try different search modes:

# If lexical finds nothing, try semantic
xf search "that thing about coffee" --mode semantic

# Check if the content type is indexed
xf stats  # Shows counts by type

# Try broader terms
xf search "coffee" --mode lexical

"Failed to parse archive"

The archive might be incomplete or from an unexpected format. Check:

# Verify the archive structure
ls ~/x-archive/data/

# Should see: tweets.js, like.js, direct-messages.js, etc.

# Try the doctor command
xf doctor --archive ~/x-archive

High memory usage

For very large archives (100K+ documents), memory usage during indexing can spike. After indexing completes, runtime memory is minimal since indices are memory-mapped.

If indexing runs out of memory:

Close other applications
Consider indexing specific types: xf index ~/archive --only tweet,like
The embedding generation is the most memory-intensive phase

Embeddings missing (semantic search returns nothing)

Re-index to generate embeddings:

xf index ~/x-archive --force

Check embedding count:

xf stats --format json | jq '.embeddings'

Limitations

What xf Doesn't Do

Real-time sync: xf works on static archive exports, not live data
Multi-archive: Only one archive at a time (re-index to switch)
Media search: Can't search image/video content (only text metadata)
True synonyms (hash mode): Hash embedder finds related words, not true synonyms ("car" won't find "automobile" unless they co-occur in your tweets). Use xf index --semantic to enable ML embeddings.
Incremental updates: Re-indexing processes the entire archive (fast enough that it rarely matters)

Known Limitations of the Hash Embedder

The hash-based embedder is fast and dependency-free, but has limitations compared to neural embedders (MiniLM is available via xf index --semantic):

Capability	Hash Embedder	Neural (BERT/MiniLM)
Word co-occurrence	✅ Yes	✅ Yes
Synonyms	❌ No	✅ Yes
Typo tolerance	❌ No	⚠️ Sometimes
Context understanding	❌ No	✅ Yes
Sentence meaning	⚠️ Bag-of-words	✅ Full context
Speed	✅ ~0ms	🐢 ~10-100ms
Dependencies	✅ None	❌ Model files

When this matters: If you search "automobile" hoping to find tweets about "cars", the hash embedder won't help. Use lexical search with explicit synonyms: xf search "car OR automobile OR vehicle".

When it doesn't matter: For personal archives, you typically remember some of the words you used. Hash-based similarity helps when your query shares vocabulary with the target text (e.g., "stressed deadlines" matches "deadline stress").

Archive Format Dependencies

xf expects the standard X data export format:

data/ directory structure
window.YTD.* JavaScript prefix
JSON arrays of tweet/DM/like objects

If X changes their export format significantly, xf may need updates to parse it correctly.

FAQ

Why "xf"?

xf stands for "x_find" - a fast way to find things in your X (formerly Twitter) data.

Is my data safe?

Yes! All data stays on your local machine. xf never sends data anywhere. The search index and database are stored locally.

Can I search old tweets?

Yes, if they're in your archive. X includes all your tweets in the data export.

What about deleted tweets?

X includes recently deleted tweets (within the last 30 days) in a separate file. xf can index these too.

How do I update?

curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/xf/main/install.sh?$(date +%s)" | bash

Or use the built-in command:

xf update

The search is slow. What's wrong?

First search after restart may be slower as the index loads. Subsequent searches should be sub-millisecond. If consistently slow, try rebuilding the index with xf index --force.

Can I search multiple archives?

Currently, xf supports one archive at a time. To switch archives, re-run xf index with the new path (use --force to clear the old data).

What query syntax is supported?

Tantivy's query parser supports:

Terms: word
Phrases: "multiple words"
Boolean: term1 AND term2, term1 OR term2
Exclusion: term1 NOT term2
Wildcards: rust*
Field-specific: type:tweet text:rust

When should I use semantic vs lexical search?

Use lexical (--mode lexical) when:

You know the exact words or phrases
You need boolean operators (AND, OR, NOT)
You're searching for specific names, hashtags, or technical terms

Use semantic (--mode semantic) when:

You want vector similarity instead of exact keyword matching
Default (hash): broader recall based on word overlap
With xf index --semantic: synonym-level matching (true semantic)

Use hybrid (default) when:

You're not sure which approach is best
You want the most comprehensive results
Hybrid combines both and uses RRF to rank results optimally

How does semantic search work?

xf supports two embedding modes:

Default (hash-based): no model downloads. Each word is hashed (FNV-1a) to deterministically select which dimensions to activate in a 384-dimensional vector. This approach:

Requires no model download (zero bytes of ML weights)
Runs in ~0ms (no GPU needed)
Produces deterministic results (same input = same output)
Works well for word overlap and topic similarity

Tradeoff: it won't understand pure synonyms (e.g., "car" vs "automobile").

Optional (ML-based): run xf index --semantic to build MiniLM embeddings. This enables true semantic matching but is slower to index and requires a one-time model download (~80MB).

Why is hybrid search the default?

Hybrid search gives you the best of both worlds:

Lexical catches exact matches — important for names, hashtags, URLs
Semantic catches related content — via vector similarity (hash by default, ML when indexed with --semantic)
RRF fusion prioritizes documents that score well in both — naturally surfacing the most relevant results

If a document ranks #1 in both lexical and semantic results, it's almost certainly what you're looking for.

Does semantic search require re-indexing?

Yes. Embeddings are generated automatically during xf index, but the embedder choice is fixed at index time:

Default: hash embeddings
Optional: ML embeddings via xf index --semantic

If you switch between hash and ML, re-run indexing so the vector index matches the embedder. Use:

xf index ~/x-archive --force            # rebuild with hash embeddings
xf index ~/x-archive --semantic --force # rebuild with ML embeddings

Contributing

About Contributions: Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via gh and independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.

License

MIT - see LICENSE for details.

Built with Rust, Tantivy, and SQLite. Features hybrid search combining keyword matching with semantic similarity via RRF fusion. Inspired by the need to actually search through years of tweets.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.beads		.beads
.github/workflows		.github/workflows
benches		benches
docs		docs
releases/v0.1.0		releases/v0.1.0
results		results
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AGENT_FRIENDLINESS_REPORT.md		AGENT_FRIENDLINESS_REPORT.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
install.sh		install.sh
rust-toolchain.toml		rust-toolchain.toml
xf_illustration.png		xf_illustration.png
xf_illustration.webp		xf_illustration.webp