Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views6 pages

Project Vision Updated

Uploaded by

peterbuics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Project Vision Updated

Uploaded by

peterbuics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Project Vision — Nova (voice-first agent)

Elevator pitch​
Nova is a voice-first AI companion that turns speech into action: real-time transcription,
translation, and desktop automation. Instead of forcing users to learn UI controls, Nova lets
them speak naturally and either (A) use a minimal, unobtrusive UI or (B) hand control entirely to
the agent so it executes actions on their behalf. This update extends our original real-time
speech-to-text vision to center the agent Nova.

Mission statement
Make voice the simplest, most intuitive interface: let people speak to create, translate, control
apps, and get things done — with privacy, reliability, and minimal friction.

Vision
Imagine a user who never needs to hunt for menus: they say what they want and Nova types,
translates, subtitles, opens apps, or performs tasks — seamlessly — while a small unobtrusive
UI is available when needed.

Current status (what we already built)


●​ Backend: Voice-typing (with translation included), detect_input_boxes
(YOLO-based), fine-tuning work to make the LLM into Nova (instruction tuning +
tool-aware behavior).​

●​ Frontend: Landing page only (no functional UI yet).​

(These extend the original project foundations around real-time transcription and overlay
captioning).

Two primary product modes (core product


decision)
1) Traditional Seamless UI mode (minimal UI)

●​ Minimal, transparent controls (e.g., small overlay/top bar like Cluely).​

●​ User can tap/hotkey to start voice features; overlay shows status, audio device,
language.​

●​ Best for users who want lightweight visual controls with obvious affordances.​

2) AI-Mode (conversation-first, agent mode)

●​ User talks naturally. Nova interprets intent, asks clarifying questions if needed, then calls
functions to act (type text, open apps, create subtitles, etc.).​

●​ No UI required beyond a tiny listening indicator; Nova confirms important destructive


actions before executing.​

●​ Best for hands-free workflows and a companion-like interaction.​

Core innovations with Nova


●​ Agent-first UX: Conversation drives actions (not menus).​

●​ Multilingual real-time typing: clipboard-based paste + smart replacement to support


any language/keyboard.​

●​ Vision + NLP integration: YOLO detects input boxes; Nova chooses and focuses a
field before typing.​

●​ Function-calling safety: All system actions go through explicit JSON tool calls that the
runtime executes.​

●​ Hybrid affordance: Users can switch at any time between UI mode and AI-mode.​

Key functions (runtime API Nova will call)


open_app(app_name: string)
open_web(url: string)
voice_typing(audio_source: string, source: string, target: string)
live_subtitle(audio_source: string)
get_audio_devices()
get_input_boxes()

(These are the primitives Nova uses to interact with the system and apps.)

Technical architecture (high-level)


●​ Client (.exe) — captures mic, streams audio to server; small overlay for UI mode;
receives function calls/commands to simulate typing, click, paste.​

●​ Server / RealtimeSTT backend — Whisper/Faster-Whisper or VietASR for Vietnamese;


language-detection fallback; returns incremental transcripts and reset on VAD silence.​

●​ Nova (fine-tuned LLM) — instruction-tuned for structured JSON outputs (tool calls,
clarifications), short-term hosted or local depending on privacy/perf.​

●​ Vision service — YOLO model to detect text input boxes + coordinates; returns
candidates for Nova to pick from.​

●​ Controller / Execution layer — receives Nova’s JSON output and safely executes
actions (with confirmation rules, undo safeguards).​

Success criteria & measurable goals


●​ Voice typing latency: ≤ 500 ms (GPU) for incremental partials; ≤ 1 s (CPU) for small
setups.​

●​ Transcription accuracy: ≥ 85% WER in controlled indoor conditions (improve with


language-specific models for Vietnamese).​

●​ Input safety: Nova must only simulate typing into a field when an input box is detected
and focused.​

●​ User satisfaction (pilot): ≥ 80% positive usability rating in initial internal tests.​
Roadmap (recommended priorities)
Immediate (now → next 2 weeks)

1.​ Lock voice-typing reliability (incremental transcripts, efficient paste/undo flow, VAD
reset behavior).​

2.​ Finish Nova fine-tuning for JSON tool calling (SFT examples + select/clarify flows).​

3.​ Test YOLO input detection end-to-end (auto-click for single box; overlay for multi-box).​

Short term (2–6 weeks)

1.​ Build minimal Seamless UI overlay (status, language pick, quick toggle).​

2.​ Integrate Nova’s tool calls with runtime executor (safeguards + confirm prompts).​

3.​ Internal user testing (students + content creators).​

Mid term (6–12 weeks)

1.​ Expand AI-Mode: robust multi-turn clarifications, session memory (last source/target
language, preferred apps).​

2.​ Improve language handling: integrate PhoWhisper for Vietnamese and fallback to
Faster-Whisper otherwise.​

3.​ Add live-subtitle overlay and cross-app subtitle positioning.​

UX & interaction rules (principles)


●​ Agent-first by default but always make the action reversible where possible.​

●​ Affirm & confirm for potentially destructive actions (sending messages, deleting text,
etc.).​

●​ Show status visually (listening, typing-ready, typing-active).​


●​ Only auto-type when input field is confirmed (single detection + auto-click, or user
choice via overlay).​

●​ Keep user in control: quick physical/easy hotkey to kill listening/stop actions.​

Privacy, security & ethics


●​ Process audio in-memory by default; persist only with explicit consent.​

●​ Local-first preference: run models locally where possible to reduce upload of raw
audio.​

●​ Audit & logging: keep optional logs for debugging with opt-in; redact PII automatically.​

●​ Fail-safe: Nova should never pretend to have executed an action; always report
success/failure back to the user.​

Risks & mitigations


●​ Wrong-field typing / data loss → Mitigate by requiring field detection and confirmation
before typing; maintain undo-safe paste strategy.​

●​ Model hallucination → Limit Nova to tool-calling + factual replies for system actions;
route open-ended stuff to a safe fallback.​

●​ Privacy leakage → Local models + opt-in telemetry; ensure audio is not stored without
consent.​

Next immediate actions (practical 7-day


checklist)
1.​ Finalize Nova SFT dataset for tool calls and clarifications; run a short LoRA SFT pass.​
2.​ Harden paste/undo replacement flow (copy → paste on first partial; Ctrl+Z + paste on
updates until VAD reset).​

3.​ Integrate YOLO detector output with get_input_boxes() call and test single/multiple
box flows.​

4.​ Build minimal overlay with “Listening / Typing Ready / Typing” states.​

5.​ Conduct 5 internal sessions and collect latencies & accuracy numbers.

You might also like