A conversational AI system for analyzing technical drawings, powered by Qwen3-VL-8B-Instruct with vLLM.
- Conversational Q&A: Ask questions about technical drawings in natural language
- Instant Upload: Upload once, ask multiple questions without re-uploading
- View Detection: Automatically detect and highlight different views (front, side, top, sections)
- Bounding Box Visualization: See detected elements highlighted on the drawing
- Fast Inference: Optimized with flash-attention and vLLM
- German & English: Supports both languages
- Python 3.12+
- Node.js 20+ (for frontend)
- CUDA 12.0+ with 48GB+ VRAM (for GPU acceleration)
- Conda or venv
# Create and activate environment
conda create -n deepseek-ocr python=3.12
conda activate deepseek-ocr
# Install dependencies
cd backend
pip install -r requirements_qwen.txt
# (Optional) Install flash-attention for 20-30% speedup
./fix_flash_attn.sh
# Start backend
./start_backend.shBackend will be available at http://localhost:8000
# Install dependencies
cd frontend
npm install
# Start development server
npm run devFrontend will be available at http://localhost:5173
βββββββββββββββ
β Frontend β React + Vite + TailwindCSS
β Port 5173 β
ββββββββ¬βββββββ
β HTTP
ββββββββΌβββββββ
β Backend β FastAPI + vLLM
β Port 8000 β
ββββββββ¬βββββββ
β
ββββββββΌβββββββββββββββββββ
β Qwen3-VL-8B-Instruct β Vision-Language Model
β with flash-attention β Direct, concise answers
βββββββββββββββββββββββββββ
Default: Qwen/Qwen3-VL-8B-Instruct (non-reasoning, direct answers)
To change model, edit backend/qwen_vision_service.py:
def __init__(self, model_path: str = "Qwen/Qwen3-VL-8B-Instruct"):Alternative models:
Qwen/Qwen3-VL-8B-Thinking- Shows reasoning (verbose with mixed results)Qwen/Qwen3-VL-4B-Instruct- Smaller, fasterQwen/Qwen3-VL-30B-A3B-Instruct- Larger, more capable (needs more VRAM at least 100GB)
Temperature & Response Length:
# In qwen_vision_service.py, adjust:
temperature=0.0 # 0.0 = most concise, 0.7 = more creative
max_tokens=512 # Lower = shorter answersGPU Memory:
# In qwen_vision_service.py:
gpu_memory_utilization=0.90 # Use 90% of GPU memoryAutomatically detects and highlights different views in technical drawings:
- Front, side, top views
- Section views (A-A, B-B, etc.)
- Detail views
- Isometric/3D views
See VIEW_DETECTION.md for details.
Upload β Chat β Bounding boxes appear automatically
The system can return bounding box coordinates for detected elements in JSON format:
[
{"bbox_2d": [x1, y1, x2, y2], "label": "front view"},
{"bbox_2d": [x1, y1, x2, y2], "label": "Γ76"}
]Grounding is automatically enabled for questions containing keywords like:
- ansicht, view, zeige, show, wo ist, finde
Upload an image for analysis.
Request:
file: Image or PDF file
Response:
{
"session_id": "abc-123-def-456",
"filename": "drawing.pdf",
"status": "ready",
"message": "Bild erfolgreich hochgeladen",
"detection_status": "processing"
}Ask a question about an uploaded image.
Request:
session_id: Session ID from uploadquestion: Question textuse_grounding: Enable bounding boxes (default: true)
Response:
{
"text": "90 mm",
"markdown": "90 mm",
"detected_elements": [...],
"image_width": 1920,
"image_height": 1080,
"processing_time": 2.34
}Check background view detection status.
Response:
{
"detection_status": "completed",
"detected_elements": [...],
"elements_count": 4
}# Check if environment is activated
conda activate deepseek-ocr
# Check if port 8000 is available
lsof -ti:8000
# Check logs
tail -f backend/logs/*.log# The system will auto-fallback to eager mode (works but slower)
# To fix properly:
cd backend
./fix_flash_attn.sh
# If it still fails, see backend/FLASH_ATTN_FIX.mdFirst startup downloads ~16GB model. This is normal and only happens once.
# Check download progress
watch -n 1 'du -sh ~/.cache/huggingface/'# Reduce GPU memory usage in qwen_vision_service.py:
gpu_memory_utilization=0.80 # Reduce from 0.90
# Or use smaller model:
model_path="Qwen/Qwen3-VL-4B-Instruct"| Metric | Value |
|---|---|
| Model Size | ~16GB |
| VRAM Usage | ~20GB (with 8B model) |
| Upload Time | < 1 second |
| First Response | 1-3 seconds |
| View Detection | 5-10 seconds (background) |
| Tokens/Query | ~50-200 tokens |
With flash-attention:
- 20-30% faster inference
- Lower memory usage
- Better throughput