Abler is an advanced AI-driven accessibility assistant designed to empower individuals with visual, auditory, and mobility impairments. By leveraging state-of-the-art machine learning techniques, Abler provides real-time object detection, depth estimation, gesture recognition, and speech processing, ensuring seamless interaction with the environment.
Abler is built on a modular, end-to-end architecture with robust ML model training, optimization, and real-time inference. The system is divided into:
- Sensor Inputs: Camera (image/video), Microphone (audio)
- Databases:
- Image/Video Datasets
- Audio Datasets
- Gesture & Human Activity Recognition (HAR) Datasets
- User Preferences & Labeling Data
- MLflow/DVC for model versioning
- Preprocessing Techniques:
- Normalization & Rescaling
- Augmentations: Rotations, Color Jitter, Mixup
- Speech: Chunking, Noise Filtering
- Model: YOLOv8x-tuned-hand-gestures
- Training Strategy: Fine-tuned on 8,000+ gestures using Curriculum Learning and Knowledge Distillation
- Object Detection:
- YOLOv8 for rapid obstacle detection
- DETR (ResNet-50 Backbone) trained on COCO 2017 (118K images) for precise localization
- Depth Estimation:
- Uses Apple DepthPro-hf for real-time depth heatmaps
- Fuses object detection with depth data for spatial awareness
- Model: 3D CNN / I3D / Transformer-based
- Techniques: Sliding Window Approach for temporal pattern recognition
- Speech-to-Text (STT):
- wav2vec2 (via
SpeechRecognitionAPI) - CMUSphinx as a fallback for noisy environments
- wav2vec2 (via
- Text-to-Speech (TTS):
- pyttsx3, Tacotron 2, WaveGlow for natural speech synthesis
- Shared Backbone: ConvNeXt, Swin Transformer
- Fusion Strategy:
- Combined Loss Function (Weighted Sum)
- Parallel Heads for Object, Gesture, Speech Tasks
- Optimization Techniques:
- Quantization, Pruning, Knowledge Distillation
- Export Formats: ONNX, TFLite, Core ML
- Continuous Capture from Camera, Microphone, Gesture Sensors
- Frame Normalization, Resizing
- Audio Chunking & Filtering
- Runs Object Detection, Depth Estimation, Gesture Recognition, HAR in parallel
- Combines multi-modal outputs
- Rule-based or learned fusion for better accuracy
- TTS (Audio Alerts)
- STT (Voice Commands)
- Haptic & Visual Feedback
- React Native / Flutter or Native (Swift/Kotlin)
- Integrates ML inference modules
- Local storage for user preferences
- ML-Ops & CI/CD Pipelines
- Hybrid Inference (Cloud + On-Device)
- Logging, Analytics, User Feedback
- PyTorch / TensorFlow
- OpenCV (Image Processing)
- Transformers (Hugging Face)
- ONNX Runtime
- SpeechRecognition, pyttsx3, Tacotron 2
- React Native / Flutter
- Firebase / SQLite (Local Storage)
- FastAPI / Flask (Backend APIs)
- Docker / Kubernetes
- AWS Lambda / GCP Functions
- DVC / MLflow (Model Versioning)
- π₯ Personalized AI Recommendations based on user behavior
- π Multi-Language Support for broader accessibility
- π‘ Edge AI Processing for ultra-fast inference
- π§ Adaptive Learning Models that improve over time
We welcome contributions from the community! Feel free to fork, raise issues, or submit PRs to improve Abler.
See LICENSE for details.
π Abler β Empowering Accessibility with AI! π‘
Developed with β€οΈ for the world by Team TheSopranos (but mainly developed for the eXathon competition).