Advanced Emotional Recognition and Observation Network
AERONET is an AI-powered, deep learning-based system designed to detect signs of mental distress in real time using multimodal inputs such as voice, facial expressions, and behavior patterns. It aids mental health professionals by providing remote psychological assessments without relying on wearable devices.
The increasing prevalence of mental health disorders has placed immense pressure on healthcare systems. Major challenges include:
- Shortage of trained mental health professionals
- Social stigma preventing people from seeking help
- Lack of real-time monitoring tools
- Inefficient early detection of conditions like depression, anxiety, or emotional distress
AERONET addresses these challenges by offering:
- Real-time detection of emotional distress using audio-visual input
- No wearables required — ideal for scalable public health screening
- Visual display of emotional trends and severity levels
- Easy integration for remote access and analysis by professionals
- How can signs of mental distress be detected in real time?
- How accurately can emotional states be classified?
- Can speech and facial cues be enough to assess mental health conditions?
- How far can vocal and facial distress signals be effectively detected?
- How fast and reliably can the system perform across environments?
- Real-time facial expression and hand gesture tracking via MediaPipe
- Custom-trained LSTM deep learning model for classification
- Visual feedback showing predicted emotional/gesture class and probability
- Simple and interactive GUI using OpenCV
- The model achieves 95% training accuracy and 87% validation accuracy, showing promising results for real-time emotional classification.
- It successfully classifies between 10 emotional/gesture-based labels with high precision.
- Misclassifications mainly occur between similar expressions such as "Neutral" and "Sad".
- The system is robust across varying lighting conditions and webcam resolutions.
- Audio-visual fusion helps improve classification reliability compared to using a single modality.
- On real-world test data, it performs with over 80% consistency in emotional prediction across multiple users.
- Works well without requiring any external wearable sensors, enhancing usability and accessibility.
```bash AERONET/ ├── DataPreProcessed/ │ ├── A/ │ ├── B/ │ └── C/ ├── RawData/ │ ├── A/ │ ├── B/ │ ├── C/ │ ├── Q/ │ └── t/ ├── TrainandValidation/ │ ├── train/ │ ├── validation/ │ └── __pycache__/ ├── application.py ├── datacollection.py ├── function.py ├── model.h5 ├── model.json ├── predata.py ├── tempCodeRunnerFile.py ├── trainingmodel.py ├── LICENSE └── README.md ```
Script: datacollection.py
- Uses OpenCV to record hand signs for each alphabet letter (A-Z).
- Captures images from the webcam and saves them in the
RawData/folder under corresponding alphabet subfolders (e.g., A, B, C...). - Only the Region of Interest (ROI) from (0, 40) to (300, 400) is captured to improve accuracy and focus on the hand region.
Script: predata.py
- Loads the recorded hand sign images.
- Uses MediaPipe to extract 21 hand landmarks (keypoints) per hand.
- Converts each 30-frame gesture sequence into a
.npyfile containing the keypoints. - Each sequence represents one gesture corresponding to a specific alphabet.
- Processed data is saved in the
DataPreProcessed/directory in the same subfolder structure as raw data.
Script: trainingmodel.py
- A deep LSTM (Long Short-Term Memory) neural network is used for sequence classification.
- Model architecture:
- 3 LSTM layers with ReLU activation and
return_sequences=True - 2 Dense layers with ReLU activation
- 1 Output Dense layer with Softmax activation for multi-class classification
- 3 LSTM layers with ReLU activation and
- Training is done using:
model.fit()for 200 epochs- TensorBoard is used for logging training metrics
- After training:
- Model weights are saved to
model.h5 - Model architecture is saved to
model.json
- Model weights are saved to
Script: application.py
- Loads the trained model from
model.jsonandmodel.h5. - Opens a live webcam feed using OpenCV.
- Uses MediaPipe to extract hand landmarks in real time.
- Maintains a buffer of the last 30 frames to form a sequence for prediction.
- Displays:
- Predicted gesture/class
- Prediction confidence score
- UI overlays using OpenCV for visual feedback
- Real-time probability bars show the model's confidence for each class.
- Sentence history (recognized gestures over time) is displayed at the top of the webcam frame.
- Prediction confidence is updated live for user transparency.
- TensorFlow / Keras – for deep learning model building and training
- MediaPipe – for hand tracking and keypoint extraction
- OpenCV – for webcam streaming and UI rendering
- NumPy – for numerical data handling
- scikit-learn – for dataset splitting and basic preprocessing