Local-first video point tracking with a React frontend, an Express/BullMQ backend, ffmpeg-based video processing, and OpenAI-compatible multimodal model inference through Ollama, LM Studio, or llama.cpp.
- Upload a local video file.
- Choose a local provider and vision-capable model.
- Set a tracking target such as
ball,hand, orplayer. - Sample frames at a configurable FPS and send them to a local multimodal endpoint.
- Receive normalized 2D points per frame.
- Preview the track over the original video in the browser.
- Download a rendered tracked MP4 and the raw JSON result.
- Frontend: React 18, TypeScript, Vite, Zustand, Axios
- Backend: Node 20, Express 4, BullMQ, Redis, OpenAI SDK, fluent-ffmpeg, Zod, Winston
- Infra: Docker, Docker Compose, nginx, Redis
.
├── backend
├── frontend
├── nginx
├── docker-compose.yml
├── docker-compose.dev.yml
└── .env.example
ollama pull llava
cp .env.example .env
docker compose up --buildOpen http://localhost:3000.
- Open LM Studio.
- Load a vision-capable model and keep it loaded.
- Start the local OpenAI-compatible server on port
1234.
cp .env.example .envEdit .env and set:
LLM_PROVIDER=lmstudioThen run:
docker compose up --buildNotes:
- The backend uses LM Studio JSON-schema output when the loaded model supports it, which improves coordinate parsing reliability.
- If a loaded LM Studio vision model has a custom ID that does not match the usual vision-name heuristics, the app now falls back to showing all loaded LM Studio models instead of hiding them.
Start the multimodal server first:
./llava-server \
-m llava-v1.6-mistral-7b.gguf \
--mmproj mmproj-model-f16.gguf \
--port 8080 \
--host 0.0.0.0Then:
cp .env.example .envEdit .env and set:
LLM_PROVIDER=llamacppRun:
docker compose up --buildStart the Redis-backed backend and Vite frontend with hot reload:
cp .env.example .env
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build- Frontend dev server:
http://localhost:5173 - Backend API:
http://localhost:4000 - Full proxied app:
http://localhost:3000
POST /api/trackGET /api/track/progress/:jobIdGET /api/track/result/:jobIdGET /api/track/download/:jobId/:filenameGET /api/models?provider=ollamaGET /api/health
See .env.example for the full list. The key values are:
LLM_PROVIDEROLLAMA_BASE_URLLMSTUDIO_BASE_URLLLAMACPP_BASE_URLMAX_UPLOAD_MBMAX_VIDEO_SECSQUEUE_CONCURRENCYFRAME_TMP_DIR
- The in-browser result player overlays points on the original uploaded clip for immediate inspection.
- The backend also renders a downloadable tracked MP4 using ffmpeg drawbox filters.
- In Docker, the backend prefers system
ffmpegandffprobebinaries;FFMPEG_PATHandFFPROBE_PATHcan override detection if needed. - When the SSE client disconnects, the backend aborts the active tracking job and cleans up runtime artifacts.
- Completed job artifacts are scheduled for deletion 30 minutes after completion.
The frontend production build and backend TypeScript build are part of the implementation workflow. The final Compose integration still depends on a locally available multimodal provider and a real sample video.