A local-first browser extension for chat with local LLM providers (Ollama, LM Studio, llama.cpp) from the browser sidepanel.
Quick links: Install · Docs · Setup Guide · Privacy · Issues
Ollama Client is a sidepanel chat extension for Chromium browsers, with Firefox support.
It is a local LLM Chrome extension where you choose provider endpoints and models.
Version v0.6.0 introduced multi-provider chat routing and local RAG workflows.
- Developers who want an Ollama client directly in the browser UI.
- Users who want an offline AI assistant workflow with local storage by default.
- Users running local provider servers (Ollama, LM Studio, llama.cpp).
- Contributors interested in browser extension + local model architecture.
- Users expecting cloud-SaaS reliability without running local infrastructure.
- Teams requiring centralized cloud sync, SSO, and org admin controls.
- Users who do not want to manage provider endpoints.
Most AI browser tools assume hosted providers and account-based workflows.
This project focuses on local-first usage:
- You configure the model endpoint.
- Chat/session data is stored locally.
- There is no built-in telemetry pipeline.
- Sidepanel-native browser UX instead of a separate desktop app.
- Multi-provider routing in one UI.
- Built-in local retrieval flow (RAG with local LLMs).
- Source-visible behavior for auditing and contribution.
| Major feature | What works now | Current limitation |
|---|---|---|
| Multi-provider chat | Route chat to Ollama, LM Studio, llama.cpp | Routing defaults to Ollama if model mapping is missing |
| Model management | Pull/delete/unload/version support for Ollama | Equivalent management actions are not yet implemented for LM Studio/llama.cpp |
| Streaming | Token streaming via runtime port with cancel support | Message keys and some hook names still use legacy ollama-* naming |
| RAG with local LLMs | Local chunking, embedding, hybrid retrieval, context injection | Embeddings use provider-native/shared routes with Ollama fallback for reliability |
| File ingestion | TXT/MD/PDF/DOCX/CSV/TSV/PSV/HTML processing | Quality depends on file quality and chunking config |
| Persistence | Chat/session/files and vectors stored in Dexie/IndexedDB | SQLite exists as migration/auxiliary path, not primary runtime store |
| Browser support | Chromium workflow and Firefox workflow are supported | Firefox may need explicit origin/CORS setup |
High-level flow:
- Sidepanel/options UI collects prompt and settings.
- UI opens runtime port to background.
- Background resolves provider by selected model mapping.
- Provider stream is relayed back to UI in chunks.
- UI updates message state and persists chat data.
- Optional RAG pipeline retrieves local context and appends it to prompt input.
Key directories:
src/sidepanel/*src/options/*src/background/*src/contents/*src/lib/providers/*src/lib/embeddings/*
Build/runtime notes:
- Extension framework: WXT (
wxtCLI)moved from Plasmo to WXT for more deterministic MV3 builds and explicit entrypoint/manifest control.
- Settings hooks/storage wrapper:
@plasmohq/storage(plasmoGlobalStorage)
Default provider profiles:
- Ollama (
http://localhost:11434) - LM Studio (
http://localhost:1234/v1) - llama.cpp server (
http://localhost:8000/v1)
Clarifying examples:
- If both Ollama and LM Studio expose
llama3, mapping decides which backend handles chat. - If mapping is absent for a model ID, fallback provider is Ollama.
In this project, RAG means local retrieval before generation:
- Uploaded or chat text is chunked.
- Chunks are embedded and stored in local vector storage.
- Query-time retrieval selects relevant chunks.
- Retrieved snippets are appended to generation context.
Clarifying example:
- Upload a local API spec PDF, then ask:
What headers are required for createUser? - Retrieved chunks from that PDF are included in prompt context before model response.
Current RAG runtime is intentionally browser-first:
- extension context only (UI + background worker)
- IndexedDB + in-memory index/cache
- HTTP-based model/embedding access
- graceful fallback over hard failure
Embedding strategy defaults:
- provider-native embeddings when available
- shared canonical target:
all-MiniLM-L6-v2 - silent background warmup
- Ollama fallback for reliability
RAG implementation details and module boundaries:
- Current behavior guide: docs/rag.md
- Full audit and redesign: docs/rag-browser-core.md
- Browser-first RAG contracts (TypeScript interfaces):
src/lib/rag/core/interfaces.ts
- Install extension from the Chrome Web Store.
- Start at least one provider endpoint.
- Configure provider URL in settings.
- Select a model and start chatting in the sidepanel.
Common endpoint examples:
http://localhost:11434(Ollama)http://localhost:1234/v1(LM Studio)http://localhost:8000/v1(llama.cpp server)
git clone https://github.com/Shishir435/ollama-client.git
cd ollama-client
pnpm install
pnpm devCommon commands:
pnpm lint:check
pnpm test:run
pnpm build
pnpm packageFirefox commands:
pnpm dev:firefox
pnpm build:firefox
pnpm package:firefox- Start provider service.
- Open extension settings and verify connection.
- Select model.
- Send prompt and monitor stream.
- Optionally upload files for retrieval context.
- Fork conversation by editing earlier user messages.
- Enable only providers you need.
- Keep model names unique when possible.
- Re-check mappings after model list changes.
- Per-model parameters are stored locally (
temperature,top_p,top_k, etc.). - Start with defaults, then tune one variable at a time.
- Adjust chunk size/overlap and retrieval thresholds.
- Narrow retrieval scope when debugging noisy answers.
- Legacy key/message naming (
ollama-*) remains in parts of multi-provider code. - Embedding support varies by provider; fallback keeps Ollama as reliability anchor.
- Reranker exists but is disabled by default due extension CSP constraints.
- Provider parity is incomplete for model management actions.
- Runtime persistence is Dexie-first while SQLite migration path still exists.
- Privacy depends on endpoint choice.
- If you configure a remote endpoint, prompts/responses are sent to that endpoint.
- Do not expose provider APIs publicly without access controls.
- Provider-agnostic naming cleanup.
- Clear single-source persistence strategy.
- Better provider parity for management actions.
- Better retrieval diagnostics.
Potential future architecture may include a desktop helper/local companion for heavier retrieval workloads.
Important constraints:
- this is not implemented
- browser-only mode remains first-class
- core runtime does not depend on helper availability
- Read CONTRIBUTING.md.
- Keep PRs scoped and testable.
- Include reproduction details for bug fixes.
- Update docs when behavior changes.
Philosophy:
- Local-first operation.
- Explicit behavior over hidden automation.
- User control of endpoint and model settings.
Non-goals:
- Managed cloud LLM platform behavior.
- Hidden telemetry for growth metrics.
- Abstracting away all local infrastructure responsibility.
MIT License: LICENCE