Vocal Web is a voice-controlled browser extension that allows users to navigate and interact with the web using natural language. It combines an LLM for translating natural language commands into high-level action plans with lightweight, heuristic-based execution for fast and reliable interactions. Browsing is intuitive and blazing fast compared to compute-heavy alternatives like Claude in Chrome, although some performance is naturally sacrificed in return.
book_flight_demo.mp4
shopping_demo.mp4
youtubePodcast.mp4
wikipediaSearch.mp4
- Architecture and workflow:
ARCHITECTURE.md - Folder summaries:
**/SUMMARY.md
- Python 3.11+
- Node.js 18+
- uv
- direnv (recommended for environment management)
- Install Python deps with uv:
uv sync. - Set up
direnvonce, then create your local env file:cp .envrc.example .envrc- Generate a strong key for
VCAA_API_KEY(minimum 32 characters, letters/numbers/-_only) usingopenssl rand -hex 32 - Run
mkcert -install && mkcert localhost 127.0.0.1 ::1for locally trusted certificates, then pointSSL_KEYFILE/SSL_CERTFILEat the generated files in.envrc. - Fill in the other secrets, then run
direnv allow - Recommended to keep
asi1-minias the model, which is currently free and performs great.
- Install JS tooling and build the extension bundle:
npm install npm run build:ext
- Load the
extension/dist/folder as an unpacked extension in Chrome. - Start the HTTP API bridge:
uv run python -m agents.api_server(defaults to port8081). - Open the extension and paste the authentication key into the API Key field in settings.
- Test the extension by using it on an active webpage, press and hold cmd/ctrl+shift+L to activate voice input.
- See
docs/security/tls-setup.mdfor TLS/HTTPS setup and operational security guidance. - This tool automates multi-step browser actions and may interact with logged-in accounts, modify data, or take unintended actions. Prompt injection or malicious web content may influence its behavior. Using a sandboxed environment for safety is recommended. Use at your own discretion.
- Currently creating my own dataset to further improve the element selection algorithms and create challenging tests. The use of language models in the navigator component will likely be reintroduced using a process-of-elimination approach once the selection algorithms are good enough to make it worth the additional cost/compute.
- Will make <3B local open-source models available for increased privacy and free operation.
- We started building this project in a hackathon with the idea that LLMs could help make the web more accesible, particularly for individuals who face challenges using traditional input devices like keyboards and mice. There is a really long way to go before this can be considered a true accessibility tool as there is still large performance improvements needed, but I'm very excited to keep building. If you have ideas or feedback, I’d love to hear from you.