Version 8.1.9 Β |Β Author: Indu Shekhar Jha Β |Β Date: July 7, 2025
Voice-Activated Banking Platform is a secure, user-centric application enabling customers to perform banking operations using natural voice commands. Designed for accessibility and robust security, it integrates advanced voice recognition, liveness detection, and biometric authentication to ensure only authorized users can execute sensitive transactions.
Target: Retail banking customers, financial institutions, and accessibility-focused organizations seeking seamless, hands-free banking experiences.
mindmap
root((Voice-Activated Banking Platform))
Executive Summary
Secure, User-Centric Application
Natural Voice Commands
Accessibility & Robust Security
Integrates Voice Recognition
Liveness Detection
Biometric Authentication
Target Audience
Retail Banking Customers
Financial Institutions
Accessibility-Focused Organizations
Key Features
Voice-driven Command Execution
Multi-factor Authentication
Real-time Speech Recognition & Synthesis
Secure Session & Error Management
Modular, Extensible Architecture
System Overview
Frontend Architecture
Component Hierarchy
App Root
Output Display
State Controller
SpeechService
MediaRecorderService
ApiService
TransactionTable
State Management (BankAppState)
AWAITING_ACTIVATION
LISTENING_FOR_COMMAND
PROCESSING
PRESENTING
Event Handling & UI Flow
Activation: 'hello bank'
Command Recording & Sending
Verification: OTP Prompt & Record
Response Presentation & Reset
Error Recovery
Errors Routed to PRESENTING
Graceful UI Inform
Resets to AWAITING_ACTIVATION
Backend Architecture
API Routing: Modular, Blueprint-based
Authentication & Session Handling
JWT for Stateless Authentication
Flask Sessions for OTP/Liveness
External Integrations
Voice/Biometric Modules
Database
Detailed Overview (Flask-based REST API)
Key Modules
API Endpoint Details
Security & Compliance
API Specification
/process_command (POST)
/generate_otp (GET)
/verify_otp_audio (POST)
/secret (GET)
Example Request/Response
Database Design
ER Diagram
USER
TRANSACTION_HISTORY
Table Definitions
User Table
TransactionHistory Table
Sample Queries
Get user by email
Get last 5 transactions
Voice & Biometric Modules
Speech Recognition & Synthesis
Intent Handling
Classifies User Intent
Extracts Entities (Amount, Recipient)
Security & Compliance
Authentication Flow
User Speaks Command
Audio for Voice Verification
Liveness Required (OTP)
User Repeats OTP
OTP Audio Sent
Verification Success
Original Command Execution
Result Presentation
Data Encryption (HTTPS, At Rest)
Authentication (JWT)
Liveness (Session-based OTP, Expiry)
Data Minimization
Safe Fallback on Error
Deployment & DevOps Pipeline
CI/CD Pipeline
Configuration (Env Vars, Secrets)
Monitoring (Uptime, Error, Security)
Rollbacks (Blue/Green, Versioned)
Frontend Architecture (Detailed)
High-Level Overview
State Management & UI Flow
Low-Level Design Details
Appendices
Glossary
References
Future Roadmap
Click to view architecture diagram
User Device
(Mic, Browser)
β¬οΈβ¬οΈ
Frontend (SPA)
(JS, HTML, CSS)
β¬οΈβ¬οΈ
Backend (API)
(App Server)
- Frontend: SPA, event-driven, browser APIs for speech/audio, stateful UI, Tailwind CSS for styling.
- Backend: RESTful API, authentication/session, business logic, biometric/voice integration.
- Database: Relational, stores user, transaction, and session data.
- Voice/Biometric: Speech-to-text, text-to-speech, voice matching, liveness detection.
- App Root
- Output Display
- State Controller
- SpeechService (recognition/synthesis)
- MediaRecorderService (audio capture)
- ApiService (backend communication)
- TransactionTable (dynamic rendering)
- Centralized state machine (
BankAppState):AWAITING_ACTIVATION: Idle, waiting for trigger phrase.LISTENING_FOR_COMMAND: Recording and recognizing user command.PROCESSING: Sending command/audio to backend, handling OTP/liveness.PRESENTING: Presenting results or errors, then resetting.
- All transitions go through a single
transitionToStatemethod, ensuring cleanup and consistent UI.
- User says "hello bank" β transitions to command listening.
- On command, records audio, sends to backend.
- If liveness/OTP required, prompts and records OTP audio.
- On backend response, presents result or error, then resets.
- All errors (network, backend, speech) are caught and routed to
PRESENTINGstate. - After presenting, always resets to
AWAITING_ACTIVATION. - UI and state are never left in an inconsistent state.
- Modular blueprint-based routing for all endpoints.
- Endpoints for command processing, OTP generation/verification, authentication, and user/session management.
- JWT-based authentication for all sensitive endpoints.
- Session management for OTP and liveness flows, with expiry and state tracking.
- Voice/biometric modules for speech recognition, synthesis, and authentication.
- Database for persistent user and transaction data.
The backend is a modular, Flask-based REST API server responsible for all business logic, authentication, session management, voice/biometric verification, and database operations. It is structured for maintainability, security, and extensibility.
- run.py: Application entry point, configures logging, initializes the Flask app, and ensures database schema creation.
- intent.py: Handles all voice command processing, OTP/liveness flows, intent recognition, and transaction logic.
- auth.py: Manages user registration and login, including voice and face biometric enrollment and verification.
- utils.py: Provides utility functions for OTP generation, normalization, fuzzy matching, audio/face processing, and biometric verification.
- Logging is configured at INFO level with timestamps and log levels for traceability.
- The Flask app is created via a factory (
create_app), ensuring modularity and testability. - On startup, the database schema is created if not present.
- The app runs in debug mode for development; production should use a WSGI server and disable debug.
- All endpoints are organized into Flask blueprints (
intent_bp,auth_bp) for separation of concerns. - Each blueprint encapsulates related routes and logic (e.g., intent, authentication).
- JWT Authentication: All sensitive endpoints require a valid JWT access token, created on successful login and checked on each request.
- Session Management: Flask session is used for OTP/liveness state, with
session.permanent = Trueto ensure persistence across requests. OTPs have expiry timestamps.
- Voice Verification: On registration, user audio is converted to a standard format and stored. On login/command, new audio is compared to the enrolled sample using a voice verification model. Distance thresholds and logging are used for security and debugging.
- Face Verification: On registration, a face encoding is extracted and stored. On login, the provided face image is compared to the stored encoding using facial recognition.
- Liveness/OTP: For every command, unless already verified in the session, a random OTP is generated, spoken to the user, and must be repeated back. The backend verifies both the voice and the recognized OTP text using fuzzy matching.
- User commands are recognized via speech-to-text and sent to
/process_command. - Intent is predicted using a trained logistic regression model on TF-IDF features (intent.py).
- Supported intents: CheckBalance, TransferMoney, GetLastTransactions.
- For transfers, entity extraction parses recipient and amount from the command text.
- All actions are logged for traceability.
- All endpoints use try/except/finally blocks for robust error handling.
- Errors are logged with context and returned as structured JSON with appropriate HTTP status codes.
- Resource cleanup (temp files) is always performed in finally blocks.
| Endpoint | Method | Auth | Parameters | Request Format | Response Format | Error Handling |
|---|---|---|---|---|---|---|
/register |
POST | No | email, user_id, audio-file, face-image | multipart/form-data | JSON: message/error | 400/500 JSON error |
/login |
POST | No | email, audio-file, face-image | multipart/form-data | JSON: access_token/error | 400/401/404/500 JSON error |
/process_command |
POST | Yes | command, voice_sample | multipart/form-data | JSON: result, error | 400/401/404/500 JSON error |
/generate_otp |
GET | Yes | - | - | JSON: otp_numeric, text | 400/401 JSON error |
/verify_otp_audio |
POST | Yes | otp_audio | multipart/form-data | JSON: success, error | 400/401/404/500 JSON error |
/secret |
GET | Yes | access_token (cookie) | - | HTML/JSON | 401/404 JSON error |
- Receives command and voice sample.
- Authenticates user via JWT.
- Verifies voice sample against enrolled audio.
- If liveness/OTP not verified, generates OTP and returns challenge.
- If OTP verified, predicts intent and executes command (balance, transfer, history).
- Logs all actions and errors.
- Receives email, user_id, audio, and face image.
- Converts and stores audio, extracts and stores face encoding.
- Creates new user in database.
- Returns success or error.
- OTP Generation: Random 4-digit numeric and text phrase.
- OTP Normalization & Fuzzy Matching: Cleans and compares recognized text to expected OTP, allowing for minor errors.
- Audio Processing: Converts uploaded audio to standard format for verification.
- Face Processing: Extracts face encodings for biometric matching.
- Uses SQLAlchemy ORM for all database operations.
- User table stores email, user_id, audio_file (binary), face_encoding (array), and balance.
- TransactionHistory table records all transactions with sender, recipient, type, amount, and timestamp.
- All queries are parameterized and indexed for performance.
- All sensitive data is encrypted in transit (HTTPS) and at rest (database encryption recommended).
- JWT tokens are used for stateless authentication.
- OTPs and session data are never exposed to the client except as needed for liveness.
- All biometric data is processed and stored securely; no raw images/audio are retained after processing.
- Logging avoids sensitive data exposure.
- All errors are handled gracefully, with no stack traces or sensitive info leaked to clients.
| Endpoint | Method | Auth | Parameters | Request Format | Response Format | Error Handling |
|---|---|---|---|---|---|---|
/register |
POST | No | email, user_id, audio-file, face-image | multipart/form-data | JSON: message/error | 400/500 JSON error |
/login |
POST | No | email, audio-file, face-image | multipart/form-data | JSON: access_token/error | 400/401/404/500 JSON error |
/process_command |
POST | Yes | command, voice_sample | multipart/form-data | JSON: result, error | 400/401/404/500 JSON error |
/generate_otp |
GET | Yes | - | - | JSON: otp_numeric, text | 400/401 JSON error |
/verify_otp_audio |
POST | Yes | otp_audio | multipart/form-data | JSON: success, error | 400/401/404/500 JSON error |
/secret |
GET | Yes | access_token (cookie) | - | HTML/JSON | 401/404 JSON error |
Example Request:
POST /process_command
Authorization: Bearer <token>
Content-Type: multipart/form-data
command=Check my balance
voice_sample=<audio file>Example Response:
{
"balance": 1234.56
}erDiagram
USER {
int user_id PK
string email
binary audio_file
float balance
%% other fields can be added here as needed
}
TRANSACTIONHISTORY {
int transaction_id PK
string acc_email
string sent_to_email
string transaction_type
float amount
datetime timestamp
}
USER ||--o{ TRANSACTIONHISTORY : has
Click to view text ER diagram
+---------+ +---------------------+
| User |<---->| TransactionHistory |
+---------+ +---------------------+
| user_id | | transaction_id |
| email | | acc_email |
| ... | | sent_to_email |
| audio | | transaction_type |
| balance | | amount |
+---------+ | timestamp |
+---------------------+
| Table | Columns | Indexes |
|---|---|---|
| User | user_id (PK), email, audio_file, balance, ... | user_id, email |
| TransactionHistory | transaction_id (PK), acc_email, sent_to_email, transaction_type, amount, timestamp | acc_email, sent_to_email |
-- Get user by email
SELECT * FROM User WHERE email = '[email protected]';
-- Get last 5 transactions
SELECT * FROM TransactionHistory WHERE acc_email = '[email protected]' ORDER BY timestamp DESC LIMIT 5;- Converts user audio to text for command and OTP.
- Synthesizes spoken responses for all outputs.
- Classifies user command intent (balance, transfer, history) using text analysis.
- Extracts entities (amount, recipient) from recognized text.
Click to view authentication flow diagram
flowchart TD
A[User Command] --> B[Voice Verification]
B -->|Liveness required| C[Generate OTP]
C --> D[Speak OTP]
D --> E[Record OTP]
E --> F[Voice & Speech Match]
F -->|Success| G[Execute command]
F -->|Fail| H[Error & Retry]
- All sensitive data encrypted in transit (HTTPS) and at rest.
- JWT authentication for all protected endpoints.
- Session-based OTP and liveness with expiry.
- No sensitive data exposed in logs or client.
- Fallback: On error, system resets to safe state and requires re-authentication.
- Data privacy: Only minimal, necessary data stored per user.
- CI/CD pipeline for automated testing, build, and deployment.
- Environment configuration via environment variables and secrets management.
- Monitoring and alerting for uptime, errors, and security events.
- Rollback strategies: Blue/green deployments, versioned releases.
- SPA: Single Page Application
- JWT: JSON Web Token
- OTP: One-Time Password
- Liveness Detection: Verifying user is present and not replaying a recording
π Roadmap
- Add support for additional biometric factors (face, fingerprint)
- Expand command set (bill pay, account linking)
- Integrate with third-party financial APIs
- Enhance accessibility features (multi-language, screen reader support)
- Advanced fraud detection and anomaly monitoring
The frontend is a modern, event-driven single-page application (SPA) that provides a seamless, voice-first user experience for banking operations. It is designed for accessibility, security, and extensibility, integrating tightly with the backend for authentication, command execution, and biometric verification.
- index.html: Login page, guides user through voice and face authentication.
- register.html: Registration page, collects and enrolls user voice and face biometrics.
- bank_index.html: Main banking dashboard, enables voice-driven banking commands and displays results.
- app.js: Orchestrates the login flow, state machine, and voice/camera capture for authentication.
- bank_script.js: Manages the banking command flow, state machine, OTP/liveness, and backend integration.
- register.js: Handles user registration, including audio recording and face capture.
- voice_ui_components.js: Provides reusable classes for speech recognition, synthesis, and media recording.
App Root (HTML)
βββ Output/Status Display
βββ State Controller (App/BankApp)
β βββ SpeechService (recognition/synthesis)
β βββ MediaRecorderService (audio capture)
β βββ ApiService (backend communication)
β βββ UI Components (forms, tables, camera, audio)
βββ TransactionTable (dynamic rendering)
- Login/Registration: Guides user through multi-step process using voice prompts, audio/face capture, and state transitions.
- Banking Dashboard: Listens for activation, processes commands, handles OTP/liveness, and presents results.
- Centralized state machines in both login (
AuthAppState) and banking (BankAppState) flows. - All transitions go through a single
transitionToStatemethod, ensuring cleanup, consistent UI, and robust error recovery. - States include: activation, input, confirmation, recording, capturing, processing, presenting, error, and reset.
- Voice Activation: Listens for trigger phrase (e.g., "hello bank" or "hello indu") to start flows.
- Speech Recognition: Captures and processes user commands, email, confirmations, and OTPs.
- Audio Recording: Uses MediaRecorder API to capture voice samples for authentication and commands.
- Face Capture: Uses getUserMedia and Canvas APIs to capture and preview face images.
- Form Submission: All data is sent to the backend via
fetchwith FormData, handling both success and error responses. - Dynamic UI Updates: Output/status fields, transaction tables, and error messages are updated in real time based on state and backend responses.
- All errors (network, backend, speech, device) are caught and routed to a safe state (
PRESENTINGorERROR). - After presenting a result or error, the app always resets to the initial state, ready for the next user action.
- Defensive coding ensures that UI and state are never left inconsistent, even on unexpected failures.
- Logging and user feedback are provided for all error conditions.
- All API calls are made via
fetchwith proper authentication (JWT in cookies or headers). - Endpoints for registration, login, command processing, OTP/liveness, and transaction history are fully integrated.
- All backend responses are handled with defensive JSON parsing and error handling.
- State transitions and UI updates are driven by backend responses (e.g., liveness required, OTP success/failure, command results).
- Wraps browser SpeechRecognition and SpeechSynthesis APIs.
- Handles start/stop, error events, and result callbacks.
- Ensures recognition is stopped before speaking to avoid feedback loops.
- Used for all voice input and output throughout the app.
- Wraps MediaRecorder API for audio capture.
- Handles start/stop, data collection, and blob creation.
- Used for both command/OTP audio and registration/login samples.
- Centralizes all backend communication (login, command, OTP, etc.).
- Handles JWT token management and error handling.
- Ensures all requests are authenticated and responses are parsed defensively.
- Each flow (login, registration, banking) is managed by a dedicated class with a state machine.
- All UI actions, API calls, and error handling are routed through state transitions.
- Ensures a consistent, recoverable user experience.
- Responsive, accessible layouts using modern CSS and Tailwind.
- Dynamic elements for audio/face capture, transaction tables, and status/output.
- All user actions are guided by voice and visual prompts.
- Audio and face are captured via browser APIs and attached to a hidden form.
- On submit, FormData is sent to
/registerendpoint. - Result or error is displayed and logged.
- Multi-step, voice-driven process: activation β email β confirmation β voice β face β submit.
- All steps are managed by state machine and voice prompts.
- On success, JWT is stored in cookie for subsequent API calls.
- Listens for activation, records command, sends to backend.
- If liveness/OTP required, prompts and records OTP audio, verifies with backend.
- On success, executes original command and presents result.
- Transaction history is dynamically rendered in a styled table.
- All API and device errors are caught and presented to the user.
- State is always reset after error or completion, preventing stuck UI.
- Defensive checks for device permissions, missing files, and backend failures.
- All flows are voice-guided and keyboard-accessible.
- Visual feedback (status, output, transaction tables) is provided at every step.
- Responsive design for desktop and mobile.
```mermaid
flowchart TD
A[AWAITING_ACTIVATION] -->|trigger phrase| B[LISTENING_FOR_COMMAND]
B -->|command spoken| C[PROCESSING]
C -->|liveness required| D[OTP PROMPT]
D -->|OTP spoken| E[PROCESSING]
E -->|result| F[PRESENTING]
F --> G[AWAITING_ACTIVATION]
transitionToState(newState) {
console.log(`[STATE] Transitioning from ${this.state} to ${newState}`);
// Clean up previous state if needed
switch (this.state) {
case BankAppState.LISTENING_FOR_COMMAND:
case BankAppState.PROCESSING:
this.speechService.stop();
this.recorderService.stopRecording();
break;
}
this.state = newState;
switch (this.state) {
case BankAppState.AWAITING_ACTIVATION:
output.textContent = "Say 'hello bank' to begin.";
this.userCommand = "";
this.livenessRequired = false;
this.otpText = null;
this.otpNumeric = null;
this.lastCommandBlob = null;
this.livenessBlob = null;
try { this.speechService.start(); } catch (e) { console.error(e); }
break;
// ...other states...
}
}- All sensitive data (audio, face images, tokens) is handled in memory and never stored in localStorage or indexedDB.
- JWT tokens are stored in cookies with secure flags.
- All API calls use HTTPS and include authentication headers.
- Device permissions are requested only as needed and released after use.