A Model Context Protocol (MCP) server that enables AI coding assistants and agentic tools to leverage local knowledge through semantic search.
Status: ✅ Fully Operational - All user stories implemented and verified
- ✅ Semantic Search: Natural language queries over your document collection
- ✅ Multi-Context Support: Organize documents into separate contexts for focused search
- ✅ Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, and images (JPG, PNG, SVG)
- ✅ Smart OCR: Automatically detects scan-only PDFs and applies OCR when needed
- ✅ Async Processing: Background indexing with progress tracking
- ✅ Persistent Storage: ChromaDB vector store with reliable document removal
- ✅ HTTP & Stdio Transports: Compatible with GitHub Copilot CLI and Claude Desktop
- ✅ MCP Integration: Compatible with Claude Desktop, GitHub Copilot, and other MCP clients
- ✅ Local & Private: All processing happens locally, no data leaves your system
Organize your documents into separate contexts for better organization and focused search results.
Contexts are isolated knowledge domains that let you:
- Organize by Topic: Separate AWS docs from healthcare docs from project-specific docs
- Search Efficiently: Search within a specific context for faster, more relevant results
- Multi-Domain Documents: Add the same document to multiple contexts
- Flexible Organization: Each context is a separate ChromaDB collection
from src.services.context_service import ContextService
# Create contexts
context_service = ContextService()
await context_service.create_context("aws-architecture", "AWS WAFR and architecture docs")
await context_service.create_context("healthcare", "Medical and compliance documents")
# Add documents to specific contexts
doc_id = await service.add_document(
Path("wafr.pdf"),
contexts=["aws-architecture"]
)
# Add to multiple contexts
doc_id = await service.add_document(
Path("fin-services-lens.pdf"),
contexts=["aws-architecture", "healthcare"]
)
# Search within a context
results = await service.search("security pillar", context="aws-architecture")
# Search across all contexts
results = await service.search("best practices") # No context = search all# Create context
knowledge-context-create aws-docs --description "AWS documentation"
# List all contexts
knowledge-context-list
# Show context details
knowledge-context-show aws-docs
# Add document to context
knowledge-add /path/to/doc.pdf --contexts aws-docs
# Add to multiple contexts
knowledge-add /path/to/doc.pdf --contexts aws-docs,healthcare
# Search in specific context
knowledge-search "security" --context aws-docs
# Delete context
knowledge-context-delete test-context --confirm trueAll documents without a specified context go to the "default" context automatically. This ensures backward compatibility with existing workflows.
The server includes intelligent OCR capabilities that automatically detect when OCR is needed:
The system analyzes extracted text quality and automatically applies OCR when:
- Extracted text is less than 100 characters (likely a scan)
- Text has less than 70% alphanumeric characters (gibberish/encoding issues)
You can force OCR processing even when text extraction is available:
# Python API
doc_id = await service.add_document(
Path("document.pdf"),
force_ocr=True # Force OCR regardless of text quality
)
# MCP Tool (via GitHub Copilot or Claude)
knowledge-add /path/to/document.pdf --force_ocr=trueFor OCR functionality, install Tesseract OCR:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler
# Windows (via Chocolatey)
choco install tesseract popplerConfigure OCR behavior in config.yaml:
ocr:
enabled: true # Enable/disable OCR
language: eng # OCR language (eng, fra, deu, spa, etc.)
force_ocr: false # Global force OCR setting
confidence_threshold: 0.0 # Accept all OCR resultsAll documents include metadata showing how they were processed:
text_extraction: Standard text extractionocr: OCR processing was usedimage_analysis: Image-only documents
Check processing method in document metadata:
documents = service.list_documents()
for doc in documents:
print(f"{doc.filename}: {doc.processing_method}")
if doc.metadata.get("ocr_used"):
confidence = doc.metadata.get("ocr_confidence", 0)
print(f" OCR confidence: {confidence:.2f}")- Python 3.11+ or Python 3.12
- Tesseract OCR (optional, for scanned documents)
# One-command setup and demo
./quickstart.shThis will:
- ✅ Create virtual environment
- ✅ Install dependencies
- ✅ Download embedding model
- ✅ Run end-to-end demo
- ✅ Show next steps
# Clone repository
git clone https://github.com/yourusername/KnowledgeMCP.git
cd KnowledgeMCP
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download embedding model (first run, ~91MB)
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"from pathlib import Path
from src.services.knowledge_service import KnowledgeService
import asyncio
async def main():
# Initialize service
service = KnowledgeService()
# Add document to specific context
doc_id = await service.add_document(
Path("document.pdf"),
metadata={"category": "technical"},
contexts=["aws-docs"], # Optional: organize by context
async_processing=False
)
# Search within a context (faster, more focused)
results = await service.search("neural networks", context="aws-docs", top_k=5)
for result in results:
print(f"{result['filename']}: {result['relevance_score']:.2f}")
print(f" Context: {result.get('context', 'default')}")
print(f" {result['chunk_text'][:100]}...")
# Search across all contexts
all_results = await service.search("neural networks", top_k=5)
# Get statistics
stats = service.get_statistics()
print(f"\nDocuments: {stats['document_count']}")
print(f"Chunks: {stats['total_chunks']}")
print(f"Contexts: {stats['context_count']}")
asyncio.run(main())# Using the management script (recommended)
./server.sh start # Start server in background
./server.sh status # Check if running
./server.sh logs # View live logs
./server.sh stop # Stop server
./server.sh restart # Restart server
# Or run directly (foreground)
python -m src.mcp.serverThe server script provides:
- ✅ Background process management
- ✅ PID file tracking
- ✅ Log file management
- ✅ Status checking
- ✅ Graceful shutdown
# Unit tests
pytest tests/unit/ -v
# Integration tests
pytest tests/integration/ -v
# End-to-end demo
python tests/e2e_demo.pyThe server exposes 11 MCP tools for AI assistants:
- knowledge-add: Add documents to knowledge base (with optional context assignment)
- knowledge-search: Semantic search with natural language queries (context-aware)
- knowledge-show: List all documents (filterable by context)
- knowledge-remove: Remove specific documents
- knowledge-clear: Clear entire knowledge base
- knowledge-status: Get statistics and health status
- knowledge-task-status: Check async processing task status
- knowledge-context-create: Create a new context for organizing documents
- knowledge-context-list: List all contexts with statistics
- knowledge-context-show: Show details of a specific context
- knowledge-context-delete: Delete a context (documents remain in other contexts)
The server is configured via config.yaml in the project root. A default configuration is provided.
# config.yaml - Default configuration provided
storage:
documents_path: ./data/documents
vector_db_path: ./data/chromadb
embedding:
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
device: cpu
chunking:
chunk_size: 500
chunk_overlap: 50
strategy: sentence
processing:
max_concurrent_tasks: 3
max_file_size_mb: 100
ocr:
enabled: true
language: eng
force_ocr: false
confidence_threshold: 0.0 # Accept all OCR results
logging:
level: INFO
format: text
search:
default_top_k: 10
max_top_k: 50Create a custom configuration file:
# Copy template
cp config.yaml.template config.yaml.local
# Edit your settings
nano config.yaml.local
# The server will use config.yaml.local if it existsConfiguration can be overridden with environment variables (prefix with KNOWLEDGE_):
# Override storage path
export KNOWLEDGE_STORAGE__DOCUMENTS_PATH=/custom/path
# Increase batch size for faster processing
export KNOWLEDGE_EMBEDDING__BATCH_SIZE=64
# Enable debug logging
export KNOWLEDGE_LOGGING__LEVEL=DEBUG
# Increase search results
export KNOWLEDGE_SEARCH__DEFAULT_TOP_K=20
# Use GPU if available
export KNOWLEDGE_EMBEDDING__DEVICE=cuda- Environment variables (highest priority)
config.yaml.local(if exists)config.yaml(default)
| Setting | Description | Default | Notes |
|---|---|---|---|
chunk_size |
Characters per chunk | 500 | Larger = more context |
batch_size |
Embeddings per batch | 32 | Higher = faster, more RAM |
device |
Computation device | cpu | Use 'cuda' for GPU |
max_file_size_mb |
Max file size | 100 | Increase for large docs |
log_level |
Logging verbosity | INFO | Use DEBUG for development |
collection_prefix |
ChromaDB collection prefix | knowledge_ | Used for context collections |
default_context |
Default context name | default | Backward compatibility |
Add to claude_desktop_config.json:
{
"mcpServers": {
"knowledge": {
"command": "python",
"args": ["-m", "src.mcp.server"],
"cwd": "/path/to/KnowledgeMCP",
"env": {
"KNOWLEDGE_STORAGE__DOCUMENTS_PATH": "/path/to/docs"
}
}
}
}Note: Claude Desktop uses stdio transport. The server automatically detects the transport mode.
The server exposes an HTTP endpoint for Copilot CLI integration using MCP Streamable HTTP.
Step 1: Start the server
./server.sh startStep 2: Configure Copilot CLI
Add to ~/.copilot/mcp-config.json:
{
"knowledge": {
"type": "http",
"url": "http://localhost:3000"
}
}Step 3: Verify integration
In Copilot CLI, the following tools will be available:
knowledge-add- Add documents to knowledge baseknowledge-search- Search with natural language queriesknowledge-show- List all documentsknowledge-remove- Remove documentsknowledge-clear- Clear knowledge baseknowledge-status- Get statisticsknowledge-task-status- Check processing status
Example usage in Copilot CLI:
# Create organized contexts
> knowledge-context-create aws-docs --description "AWS architecture documents"
# Add documents to specific contexts
> knowledge-add /path/to/wafr.pdf --contexts aws-docs
# Search within a context for focused results
> knowledge-search "security pillar" --context aws-docs
# Ask Copilot to use the knowledge base
> What are the AWS WAFR security best practices?- Vector Database: ChromaDB for semantic search with persistent storage
- Embedding Model: all-MiniLM-L6-v2 (384 dimensions, fast inference)
- OCR Engine: Tesseract for scanned documents
- Protocol: MCP over HTTP (Streamable HTTP) and stdio transports
- Server Framework: FastMCP for HTTP endpoint management
Verified performance on standard hardware (4-core CPU, 8GB RAM):
- Indexing: Documents processed in <1s (HTML), up to 30s (large PDFs)
- Search: <200ms for knowledge bases with dozens of documents
- Memory: <500MB baseline, scales with document count
- Embeddings: Batch processing, model cached locally
KnowledgeMCP/
├── src/ # Source code
│ ├── models/ # Data models (Document, Embedding, etc.)
│ ├── services/ # Core services (KnowledgeService, VectorStore)
│ ├── processors/ # Document processors (PDF, DOCX, etc.)
│ ├── mcp/ # MCP server and tools
│ ├── config/ # Configuration management
│ └── utils/ # Utilities (chunking, validation, logging)
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── e2e_demo.py # End-to-end demonstration
├── docs/ # Documentation
│ └── SERVER_MANAGEMENT.md # Server management guide
├── server.sh # Server management script ⭐
├── quickstart.sh # Quick setup script ⭐
└── README.md # This file
server.sh- Start/stop/status managementquickstart.sh- Automated setup and demotests/e2e_demo.py- Full system demonstration
- Changelog - Version history and recent changes
- Implementation Progress - Detailed progress report
- Configuration Guide - Complete configuration reference
- Server Management - Server lifecycle management
- Specification - Feature specification
- Implementation Plan - Technical plan
- Tasks - Task breakdown
- Quickstart Guide - Detailed usage guide
- MCP Tool Contracts - API specifications
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type check
mypy src/- Create processor in
src/processors/ - Inherit from
BaseProcessor - Implement
extract_text()andextract_metadata() - Register in
TextExtractor
✅ US1: Add Knowledge from Documents
- Multi-format document ingestion
- Intelligent text extraction vs OCR
- Async processing with progress tracking
- Multi-context assignment
✅ US2: Search Knowledge Semantically
- Natural language queries
- Relevance-ranked results
- Fast semantic search
- Context-scoped and cross-context search
✅ US3: Manage Knowledge Base
- List all documents
- Remove specific documents
- Clear knowledge base
- View statistics
- Context filtering
✅ US4: Integrate with AI Tools via MCP
- 11 MCP tools implemented (7 document + 4 context tools)
- JSON-RPC compatible
- Ready for AI assistant integration
✅ US5: Multi-Context Organization
- Create and manage contexts
- Add documents to multiple contexts
- Search within specific contexts
- Context isolation with separate ChromaDB collections
MIT
Contributions welcome! Please read the specification and implementation plan before submitting PRs.
- Issues: GitHub Issues
- Documentation: See
docs/andspecs/directories - Questions: Check quickstart guide and API contracts
Built with: Python, ChromaDB, Sentence Transformers, FastAPI, MCP SDK