duckdb-hybrid-doc-search

A tool for hybrid indexing of internal documents managed in Markdown using DuckDB with full-text search (FTS) + vector search (VSS), and making them callable from AI coding agents as an MCP stdio server.

Features

Advanced splitting and indexing of Markdown documents
Hybrid search combining full-text search (FTS) and vector search (VSS)
High-precision search using Japanese morphological analysis
Search result re-ranking using CrossEncoder
Integration with AI agents via MCP stdio server

Usage

Creating an Index

duckdb-hybrid-doc-search index docs/ handbook/ \
    --db index.duckdb

To use a different model, re-run the index command with a different model specified using the --embedding-model parameter.

duckdb-hybrid-doc-search index docs/ handbook/ \
    --db index.duckdb \
    --embedding-model cl-nagoya/ruri-v3-310m

You can also trim a prefix from file paths during indexing, which is useful when using Docker:

duckdb-hybrid-doc-search index docs/ handbook/ \
    --db index.duckdb \
    --trim-path-prefix "/app/"

Starting the Server

By default, the server uses STDIO transport:

duckdb-hybrid-doc-search serve --db index.duckdb

You can customize the MCP tool name and description:

duckdb-hybrid-doc-search serve --db index.duckdb \
    --tool-name "my_search" \
    --tool-description "Search my documentation"

HTTP Transport Support

The server also supports HTTP-based transport:

duckdb-hybrid-doc-search serve --db index.duckdb \
    --transport streamable-http \
    --host 127.0.0.1 \
    --port 8765 \
    --path /mcp

Changing the Model

Docker

You can also use the Docker image:

docker pull ghcr.io/upamune/duckdb-hybrid-doc-search:latest

Creating an Index with Docker

# Mount document directories and create an index
docker run -v /path/to/docs:/docs -v /path/to/output:/output \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    index /docs --db /output/index.duckdb --embedding-model cl-nagoya/ruri-v3-310m

Starting the MCP Server with Docker

STDIO Transport (Default)

# Mount only the index.duckdb file and start the server with STDIO transport
docker run -v /path/to/index.duckdb:/app/index.duckdb \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m

# With custom tool name and description
docker run -v /path/to/index.duckdb:/app/index.duckdb \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \
    --tool-name "my_search" --tool-description "Search my documentation"

HTTP Transport

# Using Streamable HTTP transport
docker run -v /path/to/index.duckdb:/app/index.duckdb -p 8765:8765 \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \
    --transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp

Note: When running in Docker, use --host 0.0.0.0 to make the server accessible from outside the container.

Searching Documents with Docker

# Direct search with a specific query
docker run -v /path/to/index.duckdb:/app/index.duckdb -it \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    search --db /app/index.duckdb --query "your search query" --rerank-model cl-nagoya/ruri-v3-reranker-310m

# Interactive search mode (when --query is omitted)
docker run -v /path/to/index.duckdb:/app/index.duckdb -it \
    ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
    search --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m

Path Manipulation in Search Results

You can manipulate file paths in search results using the following flags:

# Remove a prefix from file paths in search results
duckdb-hybrid-doc-search search --query "example" --remove-path-prefix "/app/"

# Add a prefix to file paths in search results
duckdb-hybrid-doc-search search --query "example" --add-path-prefix "docs/"

# Combine both: first remove, then add prefix
duckdb-hybrid-doc-search search --query "example" \
    --remove-path-prefix "/app/" \
    --add-path-prefix "docs/"

These flags are also available for the MCP server:

# Start server with path manipulation
duckdb-hybrid-doc-search serve --db index.duckdb \
    --remove-path-prefix "/app/" \
    --add-path-prefix "docs/"

Using as an MCP Server with VS Code and Cursor

VS Code Configuration

To use as an MCP server with VS Code:

STDIO Transport (Default)

Create a .vscode/mcp.json file in your workspace:

{
  "servers": [
    {
      "name": "DuckDB Hybrid Doc Search",
      "description": "Document search server for Markdown files",
      "connection": {
        "type": "stdio",
        "command": "docker",
        "args": [
          "run",
          "--rm",
          "-i",
          "-v", "${workspaceFolder}/index.duckdb:/app/index.duckdb",
          "ghcr.io/upamune/duckdb-hybrid-doc-search:latest",
          "serve",
          "--db", "/app/index.duckdb",
          "--rerank-model", "cl-nagoya/ruri-v3-reranker-310m",
          "--tool-name", "search_documents",
          "--tool-description", "Search for documentation"
        ]
      }
    }
  ]
}

HTTP Transport

For HTTP-based transport, use the http connection type:

{
  "servers": [
    {
      "name": "DuckDB Hybrid Doc Search",
      "description": "Document search server with HTTP transport",
      "connection": {
        "type": "http",
        "url": "http://localhost:8765/mcp"
      }
    }
  ]
}

Then start the server separately with:

docker run -v ${workspaceFolder}/index.duckdb:/app/index.duckdb -p 8765:8765 \
  ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
  serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \
  --transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp

Using a Pre-indexed Image

{
  "servers": [
    {
      "name": "DuckDB Hybrid Doc Search",
      "description": "Pre-indexed document search server",
      "connection": {
        "type": "stdio",
        "command": "docker",
        "args": [
          "run",
          "--rm",
          "-i",
          "your-org/doc-search-with-index:latest"
        ]
      }
    }
  ]
}

Cursor Configuration

To use as an MCP server with Cursor:

STDIO Transport (Default)

Create a mcp.json file in your workspace or add to your global configuration:

{
  "mcpServers": {
    "duckdb-doc-search": {
      "command": "docker",
      "args": [
        "run",
        "--rm",
        "-i",
        "-v", "${workspaceFolder}/index.duckdb:/app/index.duckdb",
        "ghcr.io/upamune/duckdb-hybrid-doc-search:latest",
        "serve",
        "--db", "/app/index.duckdb",
        "--rerank-model", "cl-nagoya/ruri-v3-reranker-310m",
        "--tool-name", "search_documents",
        "--tool-description", "Search for documentation"
      ]
    }
  }
}

HTTP Transport

For HTTP-based transport, use the url property:

{
  "mcpServers": {
    "duckdb-doc-search": {
      "url": "http://localhost:8765/mcp"
    }
  }
}

Then start the server separately with:

docker run -v ${workspaceFolder}/index.duckdb:/app/index.duckdb -p 8765:8765 \
  ghcr.io/upamune/duckdb-hybrid-doc-search:latest \
  serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \
  --transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp

Using a Pre-indexed Image

{
  "mcpServers": {
    "duckdb-doc-search": {
      "command": "docker",
      "args": [
        "run",
        "--rm",
        "-i",
        "your-org/doc-search-with-index:latest"
      ]
    }
  }
}

Practical Example: Creating and Distributing Docker Images with Pre-built Indexes

Here's a practical example for efficiently deploying document search within your organization by pre-building indexes and embedding them in Docker images.

Creating a Docker Image with Pre-built Index

When using Docker, file paths in the index often include the container path (like /app/docs/), but you might want search results to show just docs/. The --trim-path-prefix parameter solves this by removing the specified prefix from file paths during indexing.

Create a Dockerfile.with-index-args file:

# Use base image with build arguments
FROM ghcr.io/upamune/duckdb-hybrid-doc-search:latest AS builder

# Define build arguments with defaults
ARG DOCS_DIR=./docs
ARG MODEL=cl-nagoya/ruri-v3-310m

# Copy documents from specified directory
COPY ${DOCS_DIR} /docs

# Create index with specified model
RUN duckdb-hybrid-doc-search index /docs \
    --db /app/index.duckdb \
    --embedding-model ${MODEL} \
    --trim-path-prefix "/app/"

# Create final image
FROM ghcr.io/upamune/duckdb-hybrid-doc-search:latest

# Copy index file from builder
COPY --from=builder /app/index.duckdb /app/index.duckdb

# Set default command
CMD ["serve", "--db", "/app/index.duckdb", "--rerank-model", "cl-nagoya/ruri-v3-reranker-310m", "--tool-name", "search_documents", "--tool-description", "Search for documentation"]

Build and run:

# Build image for development documents
docker build -t your-org/doc-search-dev:latest \
  --build-arg DOCS_DIR=./docs-dev \
  --build-arg MODEL=cl-nagoya/ruri-v3-310m \
  -f Dockerfile.with-index-args .

# Build image for production documents
docker build -t your-org/doc-search-prod:latest \
  --build-arg DOCS_DIR=./docs-prod \
  --build-arg MODEL=cl-nagoya/ruri-v3-310m \
  -f Dockerfile.with-index-args .

# Push to your organization's registry
docker push your-org/doc-search-prod:latest

# Run with STDIO transport (default)
docker run your-org/doc-search-prod:latest

# Run with Streamable HTTP transport
docker run -p 8765:8765 -e TRANSPORT=streamable-http your-org/doc-search-prod:latest

# Run with custom HTTP settings
docker run -p 9000:9000 -e TRANSPORT=streamable-http -e PORT=9000 -e PATH=/api/mcp your-org/doc-search-prod:latest

This approach enables efficient deployment and management of document search systems within your organization.

Development

This project uses Task to manage build and development tasks.

Setting Up Development Environment

task dev:setup
source .venv/bin/activate

Running Tests

task test

Running Linters

task lint

Formatting Code

task format

Creating Document Index

task run:index DIRS="docs/ handbook/"

Starting the Server

task run:serve

Running Search

task run:search

Building and Running Docker Image

task docker:build
task docker:run CLI_ARGS="serve --db /app/index.duckdb"

Listing Available Tasks

task

Migration from Makefile

Previously, we used Makefile, but we've migrated to Task for more flexibility and features. You can replace any make command with the corresponding task command.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
sample_docs		sample_docs
src/duckdb_hybrid_document_search		src/duckdb_hybrid_document_search
.dockerignore		.dockerignore
.gitignore		.gitignore
.tagpr		.tagpr
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.with-index-args		Dockerfile.with-index-args
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
aqua-checksums.json		aqua-checksums.json
aqua.yaml		aqua.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

upamune/duckdb-hybrid-doc-search

Folders and files

Latest commit

History

Repository files navigation

duckdb-hybrid-doc-search

Features

Usage

Creating an Index

Starting the Server

HTTP Transport Support

Changing the Model

Docker

Creating an Index with Docker

Starting the MCP Server with Docker

STDIO Transport (Default)

HTTP Transport

Searching Documents with Docker

Path Manipulation in Search Results

Using as an MCP Server with VS Code and Cursor

VS Code Configuration

STDIO Transport (Default)

HTTP Transport

Using a Pre-indexed Image

Cursor Configuration

STDIO Transport (Default)

HTTP Transport

Using a Pre-indexed Image

Practical Example: Creating and Distributing Docker Images with Pre-built Indexes

Creating a Docker Image with Pre-built Index

Development

Setting Up Development Environment

Running Tests

Running Linters

Formatting Code

Creating Document Index

Starting the Server

Running Search

Building and Running Docker Image

Listing Available Tasks

Migration from Makefile

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Packages