Vocal Web

Vocal Web is a voice-controlled browser extension that allows users to navigate and interact with the web using natural language. It combines an LLM for translating natural language commands into high-level action plans with lightweight, heuristic-based execution for fast and reliable interactions. Browsing is intuitive and blazing fast compared to compute-heavy alternatives like Claude in Chrome, although some performance is naturally sacrificed in return.

Demos

Voice Input = "Show me cheap flights from Istanbul to New York on January 30th"

book_flight_demo.mp4

Voice Demo: Buying a speaker on eBay

shopping_demo.mp4

Voice Input = "I want to watch a Dwarkesh podcast video."

youtubePodcast.mp4

Text Input = "Search for the Wikipedia article on The French Revolution"

wikipediaSearch.mp4

Documentation

Architecture and workflow: ARCHITECTURE.md
Folder summaries: **/SUMMARY.md

Prerequisites

Python 3.11+
Node.js 18+
uv
direnv (recommended for environment management)

Quickstart

Install Python deps with uv: uv sync.
Set up direnv once, then create your local env file:
- cp .envrc.example .envrc
- Generate a strong key for VCAA_API_KEY (minimum 32 characters, letters/numbers/-_ only) using openssl rand -hex 32
- Run mkcert -install && mkcert localhost 127.0.0.1 ::1 for locally trusted certificates, then point SSL_KEYFILE/SSL_CERTFILE at the generated files in .envrc.
- Fill in the other secrets, then run direnv allow
- Recommended to keep asi1-mini as the model, which is currently free and performs great.
Install JS tooling and build the extension bundle:
```
npm install
npm run build:ext
```
Load the extension/dist/ folder as an unpacked extension in Chrome.
Start the HTTP API bridge: uv run python -m agents.api_server (defaults to port 8081).
Open the extension and paste the authentication key into the API Key field in settings.
Test the extension by using it on an active webpage, press and hold cmd/ctrl+shift+L to activate voice input.

Security

See docs/security/tls-setup.md for TLS/HTTPS setup and operational security guidance.
This tool automates multi-step browser actions and may interact with logged-in accounts, modify data, or take unintended actions. Prompt injection or malicious web content may influence its behavior. Using a sandboxed environment for safety is recommended. Use at your own discretion.

Next steps

Currently creating my own dataset to further improve the element selection algorithms and create challenging tests. The use of language models in the navigator component will likely be reintroduced using a process-of-elimination approach once the selection algorithms are good enough to make it worth the additional cost/compute.
Will make <3B local open-source models available for increased privacy and free operation.

Accessibility Goals

We started building this project in a hackathon with the idea that LLMs could help make the web more accesible, particularly for individuals who face challenges using traditional input devices like keyboards and mice. There is a really long way to go before this can be considered a true accessibility tool as there is still large performance improvements needed, but I'm very excited to keep building. If you have ideas or feedback, I’d love to hear from you.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
agents		agents
assets		assets
dev		dev
docs		docs
extension		extension
tests		tests
.env.example		.env.example
.envrc.example		.envrc.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pyproject.toml		pyproject.toml
tsconfig.base.json		tsconfig.base.json
tsconfig.extension.json		tsconfig.extension.json
tsconfig.tests.json		tsconfig.tests.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocal Web

Demos

Voice Input = "Show me cheap flights from Istanbul to New York on January 30th"

Voice Demo: Buying a speaker on eBay

Voice Input = "I want to watch a Dwarkesh podcast video."

Text Input = "Search for the Wikipedia article on The French Revolution"

Documentation

Prerequisites

Quickstart

Security

Next steps

Accessibility Goals

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

doruktakim/vocal-web

Folders and files

Latest commit

History

Repository files navigation

Vocal Web

Demos

Voice Input = "Show me cheap flights from Istanbul to New York on January 30th"

Voice Demo: Buying a speaker on eBay

Voice Input = "I want to watch a Dwarkesh podcast video."

Text Input = "Search for the Wikipedia article on The French Revolution"

Documentation

Prerequisites

Quickstart

Security

Next steps

Accessibility Goals

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages