AI-Powered Medical Research Data Extraction Platform
A production-ready web application for extracting structured data from clinical research papers (PDFs) using multi-agent AI, with a focus on systematic reviews of neurosurgical literature.
Features β’ Quick Start β’ Documentation β’ Architecture β’ Contributing
-
π Advanced PDF Processing
- Interactive text layers with coordinate tracking
- Geometric figure extraction via PDF.js operator interception
- Geometric table extraction with Y/X coordinate clustering
- Visual bounding box provenance (color-coded by extraction method)
-
π€ Multi-Agent AI Pipeline
- 6 specialized medical research agents powered by Google Gemini
- StudyDesignExpertAgent (92% accuracy) - Research methodology
- PatientDataSpecialistAgent (88% accuracy) - Demographics
- SurgicalExpertAgent (91% accuracy) - Procedures
- OutcomesAnalystAgent (89% accuracy) - Statistics
- NeuroimagingSpecialistAgent (92% accuracy) - Imaging
- TableExtractorAgent (100% structural validation)
- Multi-agent consensus voting with confidence scoring (95-96% accuracy)
-
βοΈ Extraction Methods
- Manual text selection with mouse
- AI-powered PICO-T extraction
- Automated table and figure analysis
- Citation provenance tracking (sentence-level coordinates)
-
π Export Formats
- JSON with complete audit trail
- CSV for spreadsheet analysis
- Excel (XLSX) with multiple sheets
- HTML audit reports with source citations
- Google Sheets integration
-
π‘οΈ Production Features
- Automatic crash detection and recovery
- Circuit breaker for API fault tolerance
- LRU caching for performance
- Comprehensive error handling
- LocalStorage persistence
- Node.js 16 or higher (Download)
- Python 3.11+ with Poetry (for backend) - Install Poetry
- Gemini API Key - Get your free key at ai.google.dev
Option 1: Backend-First (Recommended for Production) β
This option provides enhanced security by keeping API keys on the backend only.
# 1. Clone the repository
git clone https://github.com/matheus-rech/clinical-extractor.git
cd clinical-extractor
# 2. Setup backend
cd backend
poetry install
cp .env.example .env
# Edit backend/.env and add your Gemini API key:
# GEMINI_API_KEY=your_api_key_here
# Start backend (in Terminal 1)
poetry run uvicorn app.main:app --reload
# Backend running at http://localhost:8000
# 3. Setup frontend (in Terminal 2)
cd .. # Back to project root
npm install
echo 'VITE_BACKEND_URL=http://localhost:8000' > .env.local
# Start frontend
npm run dev
# Frontend running at http://localhost:3000Option 2: Frontend-Only (Development/Fallback)
For quick testing or development without backend setup.
# 1. Clone the repository
git clone https://github.com/matheus-rech/clinical-extractor.git
cd clinical-extractor
# 2. Install dependencies
npm install
# 3. Configure environment variables
cp .env.example .env.local
# Edit .env.local and add your Gemini API key:
# VITE_GEMINI_API_KEY=your_api_key_here
# 4. Start development server
npm run dev
# 5. Open browser at http://localhost:3000Option 3: Docker Compose (Production-Ready) π³
Deploy the entire stack with Docker for production or containerized development.
# 1. Clone the repository
git clone https://github.com/matheus-rech/clinical-extractor.git
cd clinical-extractor
# 2. Configure environment variables
cp .env.example .env
# Edit .env and add your Gemini API key:
# GEMINI_API_KEY=your_api_key_here
# 3. Build and start all services
docker-compose up -d
# Services will be available at:
# - Frontend: http://localhost:3000
# - Backend: http://localhost:8000
# 4. View logs
docker-compose logs -f
# 5. Stop services
docker-compose down
# 6. Rebuild after code changes
docker-compose up -d --buildFeatures:
- β Isolated containers for frontend and backend
- β Automatic health checks and restart policies
- β Nginx-based frontend serving with gzip compression
- β Multi-worker FastAPI backend (4 workers)
- β Production-optimized builds
- β
Easy scaling with
docker-compose scale
- Upload PDF - Click "Choose PDF File" or "Load Sample PDF"
- Extract Data - Use manual selection or click "Generate PICO"
- Review - Navigate through 8-step wizard to verify extracted data
- Export - Download as JSON, CSV, Excel, or submit to Google Sheets
| Document | Description |
|---|---|
| Architecture Guide | Multi-agent AI pipeline, services architecture |
| Testing Guide | Unit tests, E2E tests (95 Playwright tests) |
| Deployment Guide | Production deployment, CI/CD setup |
| Development Guide | Development workflow, best practices |
| API Integration | Gemini API, agent prompts, medical agents |
| Features | Complete feature verification and status |
| CLAUDE.md | AI assistant guide (for Claude Code) |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Clinical Extractor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Frontend (TypeScript + Vite) β
β βββ PDF Pipeline (PDF.js) β
β βββ AI Service (Gemini API) β
β βββ Multi-Agent Orchestrator β
β β βββ StudyDesignExpertAgent β
β β βββ PatientDataSpecialistAgent β
β β βββ SurgicalExpertAgent β
β β βββ OutcomesAnalystAgent β
β β βββ NeuroimagingSpecialistAgent β
β β βββ TableExtractorAgent β
β βββ Citation Service (Provenance Tracking) β
β βββ Export Manager (JSON/CSV/Excel/HTML) β
β βββ Error Recovery System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Backend (Python + FastAPI) - Optional β
β βββ ChromaDB (Vector Database) β
β βββ Advanced AI Processing β
β βββ Data Persistence β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
-
Application Initialization (src/main.ts)
- Dependency injection pattern
- Module orchestration (33 specialized modules)
-
PDF Pipeline (src/pdf/)
- PDFLoader, PDFRenderer, TextSelection
- Geometric figure & table extraction
-
AI Service (src/services/AIService.ts)
- 7 Gemini AI functions
- Circuit breaker pattern for fault tolerance
-
Multi-Agent System (src/services/AgentOrchestrator.ts)
- 6 specialized medical agents
- Consensus voting & confidence scoring
-
Data Management (src/data/ExtractionTracker.ts)
- Complete audit trails
- LocalStorage persistence
-
Error Handling (src/utils/errorBoundary.ts)
- Crash detection & recovery
- Session restoration
# Unit Tests (Jest)
npm test # Run all unit tests
npm run test:watch # Run tests in watch mode
npm run test:coverage # Generate coverage report
# E2E Tests (Playwright) - 95 tests across 8 suites
npm run test:e2e # Run all E2E tests (headless)
npm run test:e2e:headed # Run with visible browser
npm run test:e2e:debug # Step-through debugging
# Type Checking
npm run lint # TypeScript type checkingTest Results:
- β 77/96 tests pass without API key (80% - infrastructure only)
- β 96/96 tests pass with API key (100% - including AI tests)
See docs/TESTING.md for comprehensive testing guide.
# Start development server
npm run dev # Opens on http://localhost:3000
# Build for production
npm run build
# Preview production build
npm run previewclinical-extractor/
βββ src/
β βββ main.ts # Entry point & orchestration
β βββ types/ # TypeScript interfaces
β βββ config/ # Configuration
β βββ state/ # State management (Observer pattern)
β βββ data/ # Extraction tracking & persistence
β βββ forms/ # Multi-step form wizard
β βββ pdf/ # PDF.js pipeline
β βββ services/ # 16 specialized services
β βββ utils/ # Utilities & error handling
βββ tests/
β βββ unit/ # Jest unit tests (6 suites)
β βββ e2e-playwright/ # Playwright E2E tests (8 suites, 95 tests)
βββ docs/ # Documentation
βββ backend/ # Python FastAPI backend (optional)
βββ archives/ # Historical development records
See docs/DEVELOPMENT.md for development workflow and best practices.
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes:
- Follow existing code style
- Add tests for new features
- Update documentation as needed
- Run tests:
npm test && npm run test:e2e - Type check:
npm run lint - Commit changes:
git commit -m "feat: Add your feature" - Push to branch:
git push origin feature/your-feature - Open a Pull Request
We follow Conventional Commits:
feat:- New featuresfix:- Bug fixesdocs:- Documentation changestest:- Test additions/changesrefactor:- Code refactoringchore:- Build/tooling changes
Current Version: 1.0.0 (Production Ready)
- β Core extraction features complete
- β Multi-agent AI pipeline operational (95-96% accuracy)
- β 95 E2E tests + 6 unit test suites
- β Citation provenance system
- β Error recovery & fault tolerance
- β Production deployment ready
- β Comprehensive documentation
Browser Support:
- Chrome/Edge (Recommended)
- Firefox
- Safari
Requirements:
- Node.js 16+
- Modern browser with ES2022 support
- Gemini API key (free tier available)
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Built with Google Gemini AI for advanced medical text analysis
- PDF processing powered by PDF.js
- Testing with Playwright and Jest
- UI framework: Vite + TypeScript
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Made with β€οΈ for the medical research community