EEG Classification Project

A complete machine learning pipeline for classifying EEG data to distinguish between healthy subjects and those with schizophrenia. This project implements a 5-step workflow that converts raw EEG data to a deployed classification model.

Project Overview

This project provides an end-to-end solution for EEG-based mental health classification:

Data Conversion: Convert proprietary .edf files to efficient .npy format
Dataset Curation: Segment and label data for machine learning
Hyperparameter Tuning: Automatically discover optimal model architecture (CNN and LSTM)
Validation & Training: Robust evaluation and final model training
Inference: Real-world prediction tool for new EEG recordings

Model Architectures

This project implements two different deep learning approaches:

CNN (Convolutional Neural Network): Optimized for spatial pattern recognition in EEG signals
LSTM (Long Short-Term Memory): Designed to capture temporal dependencies and sequential patterns

Both models are automatically tuned and cross-validated to determine the best architecture for your specific dataset.

Project Structure

eeg-classify/
├── 1_convert_edf_to_npy.py      # Step 1: EDF to NPY conversion
├── 2_curate_dataset.py          # Step 2: Dataset segmentation and labeling
├── 3a_tune_cnn_model.py         # Step 3a: CNN hyperparameter tuning
├── 3b_tune_lstm_model.py        # Step 3b: LSTM hyperparameter tuning
├── 4a_train_and_validate_cnn.py # Step 4a: CNN cross-validation and training
├── 4b_train_and_validate_lstm.py # Step 4b: LSTM cross-validation and training
├── 5_predict.py                 # Step 5: Enhanced inference tool (batch support)
├── compare_models.py            # Model comparison utility
├── predict_drag_drop.bat        # Windows drag-and-drop prediction tool
├── requirements.txt             # Python dependencies
├── README.md                    # This file
├── data/                        # Data directory
│   ├── raw/                     # Original EDF files
│   │   ├── healthy/             # Healthy subjects
│   │   └── schizophrenia/       # Schizophrenia subjects
│   ├── npy/                     # Converted NPY files
│   ├── X_dataset.npy            # Final feature dataset
│   ├── y_dataset.npy            # Final labels
│   ├── best_hyperparameters_cnn.txt # CNN optimal hyperparameters
│   └── best_hyperparameters_lstm.txt # LSTM optimal hyperparameters
├── models/                      # Trained models
│   ├── eeg_classify_cnn_v1.h5   # Final trained CNN model
│   ├── eeg_classify_lstm_v1.h5  # Final trained LSTM model
│   ├── scaler_cnn.pkl           # CNN data scaler
│   └── scaler_lstm.pkl          # LSTM data scaler
└── logs/                        # Training logs and tuning results
    ├── tuner_cnn/               # CNN hyperparameter tuning logs
    └── tuner_lstm/              # LSTM hyperparameter tuning logs

Quick Start

Prerequisites

Python 3.7 or higher
At least 8GB RAM (16GB recommended for large datasets)
GPU support recommended for model training

Installation

Clone or download this project
Install dependencies:
```
pip install -r requirements.txt
```

Data Preparation

Prepare your EDF data by organizing it in this structure:

data/raw/
├── healthy/
│   ├── subject1.edf
│   ├── subject2.edf
│   └── ...
└── schizophrenia/
    ├── subject1.edf
    ├── subject2.edf
    └── ...

Usage Guide

Step 1: Convert EDF to NPY

Convert your EDF files to efficient NumPy format:

python 1_convert_edf_to_npy.py

What it does:

Recursively finds all .edf files in data/raw/
Loads each file with MNE and extracts numerical data
Transposes to (timesteps, channels) format
Saves as .npy files in data/npy/

Expected output:

Found 45 EDF files to convert
✓ Converted: subject1.edf -> Shape: (12000, 19)
✓ Converted: subject2.edf -> Shape: (15000, 19)
...
Conversion completed successfully!

Step 2: Curate Dataset

Create segmented, labeled dataset for training:

python 2_curate_dataset.py

What it does:

Loads all .npy files from both healthy and schizophrenia folders
Creates overlapping segments (default: 1000 timesteps with 500 overlap)
Assigns labels (0 = healthy, 1 = schizophrenia)
Saves final dataset as X_dataset.npy and y_dataset.npy

Expected output:

Processing 25 files from data/npy/healthy
Processing 20 files from data/npy/schizophrenia
Final Dataset Summary:
  X shape: (2847, 1000, 19) (samples, timesteps, channels)
  y shape: (2847,) (samples,)
  Total segments: 2847
    - Healthy (label=0): 1423
    - Schizophrenia (label=1): 1424

Step 3: Hyperparameter Tuning

The project supports two different model architectures. You can train both and compare their performance:

Option A: CNN Model

python 3a_tune_cnn_model.py

Option B: LSTM Model

python 3b_tune_lstm_model.py

What they do:

CNN: Explores 1D Convolutional architectures optimized for spatial pattern recognition
LSTM: Explores Long Short-Term Memory architectures for temporal sequence modeling
Both use KerasTuner to find optimal hyperparameters automatically
Save results to data/best_hyperparameters_cnn.txt (CNN) and data/best_hyperparameters_lstm.txt (LSTM)

This step may take several hours! The scripts will show progress:

Starting hyperparameter search...
Trial 1/50: Training model with [architecture details]...
...
Best Hyperparameters Found:
  [Model-specific parameters]
  Validation Accuracy: 0.8764

Compare Both Models

python compare_models.py

This utility compares the performance of both architectures and recommends which to use for final training.

Step 4: Final Training & Validation

Based on your model comparison results, train the better-performing architecture:

For CNN Model:

python 4a_train_and_validate_cnn.py

For LSTM Model:

python 4b_train_and_validate_lstm.py

What they do:

Load optimal hyperparameters from Step 3
Perform 5-fold stratified cross-validation for robust evaluation
Train final model on entire dataset
Save trained models:
- CNN: models/eeg_classify_cnn_v1.h5
- LSTM: models/eeg_classify_lstm_v1.h5

Expected output:

Fold 1/5 Accuracy: 0.8723
Fold 2/5 Accuracy: 0.8891
...
Cross-validation accuracy: 0.8756 ± 0.0134
Final model training accuracy: 0.9234
Model saved to: models/[model_name].h5

Step 5: Make Predictions

The enhanced prediction tool supports multiple modes for processing EEG recordings:

Single File Prediction

python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 --files data/test.edf

Batch Processing (Multiple Files)

python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 --files file1.edf file2.edf file3.edf

Directory Processing (All EDF files in a folder)

python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 --batch_dir data/test_subjects/

Drag and Drop (Windows)

Use the batch file: Simply drag EDF files onto predict_drag_drop.bat
Command line: python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 file1.edf file2.edf

CSV Export for Batch Results

python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 --batch_dir data/ --csv_output results.csv

Example batch output:

🔍 Finding EDF files...
📁 Found 15 EDF file(s) to process

[1/15] Processing: subject001.edf
📁 subject001.edf: Healthy (Confidence: 89.2%)

[2/15] Processing: subject002.edf  
📁 subject002.edf: Schizophrenia (Confidence: 76.4%)

...

BATCH PROCESSING SUMMARY
========================================
Total files processed: 15
Successful predictions: 15
Failed predictions: 0

Prediction Distribution:
  Healthy: 8 (53.3%)
  Schizophrenia: 7 (46.7%)
  Average confidence: 82.1%
  Total processing time: 45.67 seconds
  Average time per file: 3.04 seconds

Using LSTM Model

python 5_predict.py --model_path models/eeg_classify_lstm_v1.h5 --files data/test.edf --scaler_path models/scaler_lstm.pkl

Detailed Analysis Mode

python 5_predict.py --model_path models/eeg_classify_cnn_v1.h5 --files data/test.edf --verbose

Note: The prediction script defaults to the CNN model and scaler. When using the LSTM model, specify the corresponding scaler path.

Configuration

Segmentation Parameters

You can modify segmentation parameters in 2_curate_dataset.py:

SEGMENT_LENGTH = 1000    # Number of time steps per segment
OVERLAP = 500           # Number of overlapping time steps

Important: If you change these values, you must:

Re-run Step 2 to recreate the dataset
Re-run Steps 3-4 to retrain the model
The same values are automatically used in Step 5

Model Architecture

The hyperparameter search in Step 3 explores:

Conv1D layers: 1-3 layers with varying filter counts (32-512)
Kernel sizes: 3, 5, 7, or 11 timesteps
Dropout rates: 0.1-0.6 for regularization
Dense layers: 1-2 fully connected layers (32-512 units)
Learning rates: 0.0001-0.01 (log scale)
Pooling: Global average or max pooling

Performance Metrics

The pipeline provides several evaluation metrics:

Cross-validation accuracy: Robust estimate of model performance
Per-fold results: Consistency across different data splits
Confidence scores: Uncertainty estimation for predictions
Segment-level analysis: How consistent predictions are across time

Troubleshooting

Common Issues

1. "No .edf files found"

Ensure your data is in data/raw/healthy/ and data/raw/schizophrenia/
Check file extensions are exactly .edf

2. "Dataset files do not exist"

Run previous steps in order (1 → 2 → 3 → 4 → 5)
Each step depends on outputs from previous steps

3. "Out of memory" during training

Reduce batch size in the model training scripts
Use a machine with more RAM
Consider reducing segment length

4. Poor model performance

Ensure you have enough training data (>20 subjects per class)
Check data quality and preprocessing
Try different segmentation parameters
Run hyperparameter tuning longer (increase MAX_TRIALS)

Data Quality Checks

Before running the pipeline, verify:

EDF files load correctly with MNE
All files have the same number of channels
Sampling rates are consistent
No corrupted or truncated recordings

Expected Results

With a good dataset, you should expect:

Cross-validation accuracy: 80-90%
Training time:
- Step 1: Minutes
- Step 2: Minutes
- Step 3: Hours (depending on MAX_TRIALS)
- Step 4: 30-60 minutes
- Step 5: Seconds per prediction

Technical Details

Model Architecture

The final model is a 1D Convolutional Neural Network optimized for time-series EEG data:

Input: EEG segments of shape (timesteps, channels)
Conv1D layers: Extract temporal patterns
Batch normalization: Stabilize training
Max pooling: Reduce dimensionality
Dropout: Prevent overfitting
Global pooling: Aggregate features
Dense layers: Final classification
Sigmoid output: Binary probability

Data Processing

Standardization: Z-score normalization per channel
Segmentation: Overlapping windows for data augmentation
Stratification: Balanced train/validation splits

References

This implementation is based on research in EEG-based mental health classification:

EEG Processing: MNE-Python library
Deep Learning: TensorFlow/Keras
Hyperparameter Optimization: Keras Tuner with Hyperband algorithm
Evaluation: Stratified k-fold cross-validation

Contributing

Feel free to submit issues, suggestions, or improvements. This pipeline can be adapted for other EEG classification tasks.

Note: This is a research tool. Any medical applications should undergo proper validation and regulatory approval.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
1_convert_edf_to_npy.py		1_convert_edf_to_npy.py
2_curate_dataset.py		2_curate_dataset.py
3a_tune_cnn_model.py		3a_tune_cnn_model.py
3b_tune_lstm_model.py		3b_tune_lstm_model.py
4a_train_and_validate_cnn.py		4a_train_and_validate_cnn.py
4b_train_and_validate_lstm.py		4b_train_and_validate_lstm.py
5_predict.py		5_predict.py
LICENSE		LICENSE
README.md		README.md
compare_models.py		compare_models.py
predict_drag_drop.bat		predict_drag_drop.bat
requirements.txt		requirements.txt

License

Metanome/eeg-classify

Folders and files

Latest commit

History

Repository files navigation

EEG Classification Project

Project Overview

Model Architectures

Project Structure

Quick Start

Prerequisites

Installation

Data Preparation

Usage Guide

Step 1: Convert EDF to NPY

Step 2: Curate Dataset

Step 3: Hyperparameter Tuning

Option A: CNN Model

Option B: LSTM Model

Compare Both Models

Step 4: Final Training & Validation

For CNN Model:

For LSTM Model:

Step 5: Make Predictions

Single File Prediction

Batch Processing (Multiple Files)

Directory Processing (All EDF files in a folder)

Drag and Drop (Windows)

CSV Export for Batch Results

Using LSTM Model

Detailed Analysis Mode

Configuration

Segmentation Parameters

Model Architecture

Performance Metrics

Troubleshooting

Common Issues

Data Quality Checks

Expected Results

Technical Details

Model Architecture

Data Processing

References

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages