0% found this document useful (0 votes)

3 views6 pages

Project Vision Updated

Uploaded by

peterbuics

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views6 pages

Project Vision Updated

Uploaded by

peterbuics

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Project Vision — Nova (voice-first agent)

Elevator pitch
Nova is a voice-first AI companion that turns speech into action: real-time transcription,
translation, and desktop automation. Instead of forcing users to learn UI controls, Nova lets
them speak naturally and either (A) use a minimal, unobtrusive UI or (B) hand control entirely to
the agent so it executes actions on their behalf. This update extends our original real-time
speech-to-text vision to center the agent Nova.

Mission statement
Make voice the simplest, most intuitive interface: let people speak to create, translate, control
apps, and get things done — with privacy, reliability, and minimal friction.

Vision
Imagine a user who never needs to hunt for menus: they say what they want and Nova types,
translates, subtitles, opens apps, or performs tasks — seamlessly — while a small unobtrusive
UI is available when needed.

Current status (what we already built)

● Backend: Voice-typing (with translation included), detect_input_boxes
(YOLO-based), fine-tuning work to make the LLM into Nova (instruction tuning +
tool-aware behavior).

● Frontend: Landing page only (no functional UI yet).

(These extend the original project foundations around real-time transcription and overlay
captioning).

Two primary product modes (core product

decision)
1) Traditional Seamless UI mode (minimal UI)

● Minimal, transparent controls (e.g., small overlay/top bar like Cluely).

● User can tap/hotkey to start voice features; overlay shows status, audio device,
language.

● Best for users who want lightweight visual controls with obvious affordances.

2) AI-Mode (conversation-first, agent mode)

● User talks naturally. Nova interprets intent, asks clarifying questions if needed, then calls
functions to act (type text, open apps, create subtitles, etc.).

● No UI required beyond a tiny listening indicator; Nova confirms important destructive

actions before executing.

● Best for hands-free workflows and a companion-like interaction.

Core innovations with Nova

● Agent-first UX: Conversation drives actions (not menus).

● Multilingual real-time typing: clipboard-based paste + smart replacement to support

any language/keyboard.

● Vision + NLP integration: YOLO detects input boxes; Nova chooses and focuses a
field before typing.

● Function-calling safety: All system actions go through explicit JSON tool calls that the
runtime executes.

● Hybrid affordance: Users can switch at any time between UI mode and AI-mode.

Key functions (runtime API Nova will call)

open_app(app_name: string)
open_web(url: string)
voice_typing(audio_source: string, source: string, target: string)
live_subtitle(audio_source: string)
get_audio_devices()
get_input_boxes()

(These are the primitives Nova uses to interact with the system and apps.)

Technical architecture (high-level)

● Client (.exe) — captures mic, streams audio to server; small overlay for UI mode;
receives function calls/commands to simulate typing, click, paste.

● Server / RealtimeSTT backend — Whisper/Faster-Whisper or VietASR for Vietnamese;

language-detection fallback; returns incremental transcripts and reset on VAD silence.

● Nova (fine-tuned LLM) — instruction-tuned for structured JSON outputs (tool calls,
clarifications), short-term hosted or local depending on privacy/perf.

● Vision service — YOLO model to detect text input boxes + coordinates; returns
candidates for Nova to pick from.

● Controller / Execution layer — receives Nova’s JSON output and safely executes
actions (with confirmation rules, undo safeguards).

Success criteria & measurable goals

● Voice typing latency: ≤ 500 ms (GPU) for incremental partials; ≤ 1 s (CPU) for small
setups.

● Transcription accuracy: ≥ 85% WER in controlled indoor conditions (improve with

language-specific models for Vietnamese).

● Input safety: Nova must only simulate typing into a field when an input box is detected
and focused.

● User satisfaction (pilot): ≥ 80% positive usability rating in initial internal tests.
Roadmap (recommended priorities)
Immediate (now → next 2 weeks)

1. Lock voice-typing reliability (incremental transcripts, efficient paste/undo flow, VAD
reset behavior).

2. Finish Nova fine-tuning for JSON tool calling (SFT examples + select/clarify flows).

3. Test YOLO input detection end-to-end (auto-click for single box; overlay for multi-box).

Short term (2–6 weeks)

1. Build minimal Seamless UI overlay (status, language pick, quick toggle).

2. Integrate Nova’s tool calls with runtime executor (safeguards + confirm prompts).

3. Internal user testing (students + content creators).

Mid term (6–12 weeks)

1. Expand AI-Mode: robust multi-turn clarifications, session memory (last source/target
language, preferred apps).

2. Improve language handling: integrate PhoWhisper for Vietnamese and fallback to
Faster-Whisper otherwise.

3. Add live-subtitle overlay and cross-app subtitle positioning.

UX & interaction rules (principles)

● Agent-first by default but always make the action reversible where possible.

● Affirm & confirm for potentially destructive actions (sending messages, deleting text,
etc.).

● Show status visually (listening, typing-ready, typing-active).

● Only auto-type when input field is confirmed (single detection + auto-click, or user
choice via overlay).

● Keep user in control: quick physical/easy hotkey to kill listening/stop actions.

Privacy, security & ethics

● Process audio in-memory by default; persist only with explicit consent.

● Local-first preference: run models locally where possible to reduce upload of raw
audio.

● Audit & logging: keep optional logs for debugging with opt-in; redact PII automatically.

● Fail-safe: Nova should never pretend to have executed an action; always report
success/failure back to the user.

Risks & mitigations

● Wrong-field typing / data loss → Mitigate by requiring field detection and confirmation
before typing; maintain undo-safe paste strategy.

● Model hallucination → Limit Nova to tool-calling + factual replies for system actions;
route open-ended stuff to a safe fallback.

● Privacy leakage → Local models + opt-in telemetry; ensure audio is not stored without
consent.

Next immediate actions (practical 7-day

checklist)
1. Finalize Nova SFT dataset for tool calls and clarifications; run a short LoRA SFT pass.
2. Harden paste/undo replacement flow (copy → paste on first partial; Ctrl+Z + paste on
updates until VAD reset).

3. Integrate YOLO detector output with get_input_boxes() call and test single/multiple
box flows.

4. Build minimal overlay with “Listening / Typing Ready / Typing” states.

5. Conduct 5 internal sessions and collect latencies & accuracy numbers.

Humanize AI
No ratings yet
Humanize AI
1 page
Healy in Comparison To Royal Rife Machines, Spooky2 and Other Frequency Devices in General
100% (7)
Healy in Comparison To Royal Rife Machines, Spooky2 and Other Frequency Devices in General
2 pages
Personal Voice Assistant in Python
100% (1)
Personal Voice Assistant in Python
30 pages
Focus 1 Unit 3 Test 7125839 Viktoriia Stepaniuk Live
No ratings yet
Focus 1 Unit 3 Test 7125839 Viktoriia Stepaniuk Live
1 page
Web Design Process Guide
No ratings yet
Web Design Process Guide
1 page
Six Weeks Industrial Training Report by Atul Kumar - 20230814 - 172719 - 0000
No ratings yet
Six Weeks Industrial Training Report by Atul Kumar - 20230814 - 172719 - 0000
56 pages
Audio
No ratings yet
Audio
4 pages
AI Assistant PBL Project
No ratings yet
AI Assistant PBL Project
13 pages
Next-Gen Virtual Assistants Design
No ratings yet
Next-Gen Virtual Assistants Design
17 pages
AI Voice Assistant With Task Automation
No ratings yet
AI Voice Assistant With Task Automation
14 pages
Ai Presentation Draft
No ratings yet
Ai Presentation Draft
14 pages
Voice Assistent Using Python Synopsis
No ratings yet
Voice Assistent Using Python Synopsis
10 pages
Voice Assistant AI Python
No ratings yet
Voice Assistant AI Python
10 pages
Cracking Codes With Python Al Sweigart Download
100% (1)
Cracking Codes With Python Al Sweigart Download
47 pages
Personal Voice Assistant
No ratings yet
Personal Voice Assistant
7 pages
Nova's Knowledge
No ratings yet
Nova's Knowledge
1 page
Keeping OpenShift Evergreen
No ratings yet
Keeping OpenShift Evergreen
37 pages
Shukra
No ratings yet
Shukra
4 pages
Virtual Desktop Assistant
No ratings yet
Virtual Desktop Assistant
11 pages
Synopsis
No ratings yet
Synopsis
6 pages
224s 22 Lec7
No ratings yet
224s 22 Lec7
50 pages
Personal Voice Assistant in Python
86% (22)
Personal Voice Assistant in Python
30 pages
Thesis Title Sample Information Technology
100% (4)
Thesis Title Sample Information Technology
5 pages
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
No ratings yet
Building A Windows Desktop AI Assistant (Python, Voice I - O, 3D Avatar)
5 pages
Project Report
No ratings yet
Project Report
39 pages
VisioNR Final
No ratings yet
VisioNR Final
22 pages
NOVA Paper
No ratings yet
NOVA Paper
3 pages
Python Assistent Mini Project Report
No ratings yet
Python Assistent Mini Project Report
23 pages
Akinwunmi Akintan Ifs-19-0598 .Proposal Slide
No ratings yet
Akinwunmi Akintan Ifs-19-0598 .Proposal Slide
22 pages
System Overview
No ratings yet
System Overview
6 pages
Documentation
No ratings yet
Documentation
5 pages
Ai Voice Assistant PPT Project
0% (1)
Ai Voice Assistant PPT Project
22 pages
Literature Review
No ratings yet
Literature Review
5 pages
Unit - 1
No ratings yet
Unit - 1
39 pages
Handout An Overview of Amazon Nova Understanding Models
No ratings yet
Handout An Overview of Amazon Nova Understanding Models
35 pages
Ai Project (Voice Assisstant)
No ratings yet
Ai Project (Voice Assisstant)
18 pages
Review 5
No ratings yet
Review 5
17 pages
Research Paper 2
No ratings yet
Research Paper 2
6 pages
Anurag Synop
No ratings yet
Anurag Synop
9 pages
IFLYTEK Voice Trend
No ratings yet
IFLYTEK Voice Trend
22 pages
My PPT Vox
No ratings yet
My PPT Vox
29 pages
Real-Time Simulation with FLIGHTLAB
No ratings yet
Real-Time Simulation with FLIGHTLAB
18 pages
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
No ratings yet
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
8 pages
Final Report
No ratings yet
Final Report
7 pages
Voice Assistant for All Users
No ratings yet
Voice Assistant for All Users
11 pages
Programming With Uni Cod
No ratings yet
Programming With Uni Cod
63 pages
Personal Assistant Chatbot
No ratings yet
Personal Assistant Chatbot
5 pages
Jdsis Paper Oth Oth
No ratings yet
Jdsis Paper Oth Oth
5 pages
Understanding
No ratings yet
Understanding
1 page
Research Paper Publish
No ratings yet
Research Paper Publish
8 pages
1nd Progress Presentation 2023 AI 1 Update
No ratings yet
1nd Progress Presentation 2023 AI 1 Update
15 pages
Rag - Voice Assistant: Features Modular Design Support For Multiple Apis Configuration Management
No ratings yet
Rag - Voice Assistant: Features Modular Design Support For Multiple Apis Configuration Management
3 pages
(15-22) Journal of Artificial Neural Networks and Learning System Vol 1 Issue 3 Year 2024
No ratings yet
(15-22) Journal of Artificial Neural Networks and Learning System Vol 1 Issue 3 Year 2024
8 pages
File Management in Cloud Storage Platforms
No ratings yet
File Management in Cloud Storage Platforms
8 pages
NLP Chatbot AI
No ratings yet
NLP Chatbot AI
8 pages
Py Report
No ratings yet
Py Report
8 pages
Brosur Elektronik 15 April 2023
No ratings yet
Brosur Elektronik 15 April 2023
2 pages
Summarization - Doc - Jupyter Notebook
No ratings yet
Summarization - Doc - Jupyter Notebook
12 pages
JETIR2403582
No ratings yet
JETIR2403582
7 pages
Um2206 stm32 Nucleo64p Boards mb1319 Stmicroelectronics
No ratings yet
Um2206 stm32 Nucleo64p Boards mb1319 Stmicroelectronics
52 pages
Implementation of Virtual Assistant With Sign Language Using Deep Learning and TensorFlow
No ratings yet
Implementation of Virtual Assistant With Sign Language Using Deep Learning and TensorFlow
7 pages
Department of Mechanical Engineering: Mini Project Phase 1 Presentation
No ratings yet
Department of Mechanical Engineering: Mini Project Phase 1 Presentation
12 pages
Voice Assistant System Design Guide
No ratings yet
Voice Assistant System Design Guide
4 pages
Final
No ratings yet
Final
12 pages
Korean AI Agency Pitch Deck XL by Slidesgo
No ratings yet
Korean AI Agency Pitch Deck XL by Slidesgo
9 pages
Voice Assistants
No ratings yet
Voice Assistants
7 pages
Journalsresaim Ijresm v3 I7 32
No ratings yet
Journalsresaim Ijresm v3 I7 32
3 pages
Geoeasy
No ratings yet
Geoeasy
17 pages
SSRN Id4384623
No ratings yet
SSRN Id4384623
4 pages
Windows User Account Management Lab
No ratings yet
Windows User Account Management Lab
3 pages
Machine Learning Techniques LAB FILE - KAI651
No ratings yet
Machine Learning Techniques LAB FILE - KAI651
16 pages
SQP 43 - QP
No ratings yet
SQP 43 - QP
10 pages
Web Technologies Week 03-04 (CSS)
No ratings yet
Web Technologies Week 03-04 (CSS)
50 pages
17 P 19 Curriculum CO Semester III 2
No ratings yet
17 P 19 Curriculum CO Semester III 2
28 pages
Bidding Process Flow For Bank E Auction
No ratings yet
Bidding Process Flow For Bank E Auction
25 pages
Marvell Phys Transceivers Alaska 88e1548 88e1548p Product Brief 2015 08
No ratings yet
Marvell Phys Transceivers Alaska 88e1548 88e1548p Product Brief 2015 08
2 pages
PCI Express Validation with IFV
No ratings yet
PCI Express Validation with IFV
12 pages
0 - Module 0 Fundamental Introduction (Huawei VRP) PDF
No ratings yet
0 - Module 0 Fundamental Introduction (Huawei VRP) PDF
4 pages
SQL Server Always On - Overview
No ratings yet
SQL Server Always On - Overview
4 pages
CS158 1 Reviewer
No ratings yet
CS158 1 Reviewer
8 pages
LVMH IT Landscape
No ratings yet
LVMH IT Landscape
5 pages
Advanced Virtual Assistant Based On Speech Processing Oriented Technology On Edge Concept S.P.O.T
No ratings yet
Advanced Virtual Assistant Based On Speech Processing Oriented Technology On Edge Concept S.P.O.T
4 pages
SIMnet - W10-1 (Up To 10 Points)
No ratings yet
SIMnet - W10-1 (Up To 10 Points)
3 pages
Java 2 Slips
No ratings yet
Java 2 Slips
60 pages
A Digitally Tuned Anti-Aliasing and Reconstruction Filter (LTC1564)
No ratings yet
A Digitally Tuned Anti-Aliasing and Reconstruction Filter (LTC1564)
2 pages
How To View Visit History in Apple Maps and Google Maps
No ratings yet
How To View Visit History in Apple Maps and Google Maps
2 pages
Jarvis: Virtual Voice Command Desktop Assistant
No ratings yet
Jarvis: Virtual Voice Command Desktop Assistant
4 pages

Project Vision Updated

Uploaded by

Project Vision Updated

Uploaded by

Project Vision — Nova (voice-first agent)

Current status (what we already built)

●​ Frontend: Landing page only (no functional UI yet).​

Two primary product modes (core product

●​ Minimal, transparent controls (e.g., small overlay/top bar like Cluely).​

2) AI-Mode (conversation-first, agent mode)

●​ No UI required beyond a tiny listening indicator; Nova confirms important destructive

●​ Best for hands-free workflows and a companion-like interaction.​

Core innovations with Nova

●​ Multilingual real-time typing: clipboard-based paste + smart replacement to support

Key functions (runtime API Nova will call)

Technical architecture (high-level)

●​ Server / RealtimeSTT backend — Whisper/Faster-Whisper or VietASR for Vietnamese;

Success criteria & measurable goals

●​ Transcription accuracy: ≥ 85% WER in controlled indoor conditions (improve with

Short term (2–6 weeks)

3.​ Internal user testing (students + content creators).​

Mid term (6–12 weeks)

3.​ Add live-subtitle overlay and cross-app subtitle positioning.​

UX & interaction rules (principles)

●​ Show status visually (listening, typing-ready, typing-active).​

●​ Keep user in control: quick physical/easy hotkey to kill listening/stop actions.​

Privacy, security & ethics

Risks & mitigations

Next immediate actions (practical 7-day

You might also like

● Frontend: Landing page only (no functional UI yet).

● Minimal, transparent controls (e.g., small overlay/top bar like Cluely).

● No UI required beyond a tiny listening indicator; Nova confirms important destructive

● Best for hands-free workflows and a companion-like interaction.

● Multilingual real-time typing: clipboard-based paste + smart replacement to support

● Server / RealtimeSTT backend — Whisper/Faster-Whisper or VietASR for Vietnamese;

● Transcription accuracy: ≥ 85% WER in controlled indoor conditions (improve with

3. Internal user testing (students + content creators).

3. Add live-subtitle overlay and cross-app subtitle positioning.

● Show status visually (listening, typing-ready, typing-active).

● Keep user in control: quick physical/easy hotkey to kill listening/stop actions.