Project Vision — Nova (voice-first agent)
Elevator pitch
Nova is a voice-first AI companion that turns speech into action: real-time transcription,
translation, and desktop automation. Instead of forcing users to learn UI controls, Nova lets
them speak naturally and either (A) use a minimal, unobtrusive UI or (B) hand control entirely to
the agent so it executes actions on their behalf. This update extends our original real-time
speech-to-text vision to center the agent Nova.
Mission statement
Make voice the simplest, most intuitive interface: let people speak to create, translate, control
apps, and get things done — with privacy, reliability, and minimal friction.
Vision
Imagine a user who never needs to hunt for menus: they say what they want and Nova types,
translates, subtitles, opens apps, or performs tasks — seamlessly — while a small unobtrusive
UI is available when needed.
Current status (what we already built)
● Backend: Voice-typing (with translation included), detect_input_boxes
(YOLO-based), fine-tuning work to make the LLM into Nova (instruction tuning +
tool-aware behavior).
● Frontend: Landing page only (no functional UI yet).
(These extend the original project foundations around real-time transcription and overlay
captioning).
Two primary product modes (core product
decision)
1) Traditional Seamless UI mode (minimal UI)
● Minimal, transparent controls (e.g., small overlay/top bar like Cluely).
● User can tap/hotkey to start voice features; overlay shows status, audio device,
language.
● Best for users who want lightweight visual controls with obvious affordances.
2) AI-Mode (conversation-first, agent mode)
● User talks naturally. Nova interprets intent, asks clarifying questions if needed, then calls
functions to act (type text, open apps, create subtitles, etc.).
● No UI required beyond a tiny listening indicator; Nova confirms important destructive
actions before executing.
● Best for hands-free workflows and a companion-like interaction.
Core innovations with Nova
● Agent-first UX: Conversation drives actions (not menus).
● Multilingual real-time typing: clipboard-based paste + smart replacement to support
any language/keyboard.
● Vision + NLP integration: YOLO detects input boxes; Nova chooses and focuses a
field before typing.
● Function-calling safety: All system actions go through explicit JSON tool calls that the
runtime executes.
● Hybrid affordance: Users can switch at any time between UI mode and AI-mode.
Key functions (runtime API Nova will call)
open_app(app_name: string)
open_web(url: string)
voice_typing(audio_source: string, source: string, target: string)
live_subtitle(audio_source: string)
get_audio_devices()
get_input_boxes()
(These are the primitives Nova uses to interact with the system and apps.)
Technical architecture (high-level)
● Client (.exe) — captures mic, streams audio to server; small overlay for UI mode;
receives function calls/commands to simulate typing, click, paste.
● Server / RealtimeSTT backend — Whisper/Faster-Whisper or VietASR for Vietnamese;
language-detection fallback; returns incremental transcripts and reset on VAD silence.
● Nova (fine-tuned LLM) — instruction-tuned for structured JSON outputs (tool calls,
clarifications), short-term hosted or local depending on privacy/perf.
● Vision service — YOLO model to detect text input boxes + coordinates; returns
candidates for Nova to pick from.
● Controller / Execution layer — receives Nova’s JSON output and safely executes
actions (with confirmation rules, undo safeguards).
Success criteria & measurable goals
● Voice typing latency: ≤ 500 ms (GPU) for incremental partials; ≤ 1 s (CPU) for small
setups.
● Transcription accuracy: ≥ 85% WER in controlled indoor conditions (improve with
language-specific models for Vietnamese).
● Input safety: Nova must only simulate typing into a field when an input box is detected
and focused.
● User satisfaction (pilot): ≥ 80% positive usability rating in initial internal tests.
Roadmap (recommended priorities)
Immediate (now → next 2 weeks)
1. Lock voice-typing reliability (incremental transcripts, efficient paste/undo flow, VAD
reset behavior).
2. Finish Nova fine-tuning for JSON tool calling (SFT examples + select/clarify flows).
3. Test YOLO input detection end-to-end (auto-click for single box; overlay for multi-box).
Short term (2–6 weeks)
1. Build minimal Seamless UI overlay (status, language pick, quick toggle).
2. Integrate Nova’s tool calls with runtime executor (safeguards + confirm prompts).
3. Internal user testing (students + content creators).
Mid term (6–12 weeks)
1. Expand AI-Mode: robust multi-turn clarifications, session memory (last source/target
language, preferred apps).
2. Improve language handling: integrate PhoWhisper for Vietnamese and fallback to
Faster-Whisper otherwise.
3. Add live-subtitle overlay and cross-app subtitle positioning.
UX & interaction rules (principles)
● Agent-first by default but always make the action reversible where possible.
● Affirm & confirm for potentially destructive actions (sending messages, deleting text,
etc.).
● Show status visually (listening, typing-ready, typing-active).
● Only auto-type when input field is confirmed (single detection + auto-click, or user
choice via overlay).
● Keep user in control: quick physical/easy hotkey to kill listening/stop actions.
Privacy, security & ethics
● Process audio in-memory by default; persist only with explicit consent.
● Local-first preference: run models locally where possible to reduce upload of raw
audio.
● Audit & logging: keep optional logs for debugging with opt-in; redact PII automatically.
● Fail-safe: Nova should never pretend to have executed an action; always report
success/failure back to the user.
Risks & mitigations
● Wrong-field typing / data loss → Mitigate by requiring field detection and confirmation
before typing; maintain undo-safe paste strategy.
● Model hallucination → Limit Nova to tool-calling + factual replies for system actions;
route open-ended stuff to a safe fallback.
● Privacy leakage → Local models + opt-in telemetry; ensure audio is not stored without
consent.
Next immediate actions (practical 7-day
checklist)
1. Finalize Nova SFT dataset for tool calls and clarifications; run a short LoRA SFT pass.
2. Harden paste/undo replacement flow (copy → paste on first partial; Ctrl+Z + paste on
updates until VAD reset).
3. Integrate YOLO detector output with get_input_boxes() call and test single/multiple
box flows.
4. Build minimal overlay with “Listening / Typing Ready / Typing” states.
5. Conduct 5 internal sessions and collect latencies & accuracy numbers.