Production deployment of PP-StructureV3 on Triton Inference Server, extended for XYNE/Juspay workloads. This project starts from the PaddleX High-Stability Serving (HPS) SDK for PP-StructureV312 and layers operational tooling, performance tuning, and optional multimodal captioning that we run in production at xynehq/xyne and across other Juspay services.
- paddlex-server container packages the PP-StructureV3 layout pipeline as a Triton Python backend (model
layout-parsing) and ships the pipeline configs fromserver/. - blip-server (optional) hosts a separate Triton instance serving BLIP image captioning (
blip-caption) that model.py can call asynchronously for richer markdown output. - status_server.py exposes
/instance_statusand/paddlex_instance_statuson port8081, aggregating per-instance JSON heartbeats written by each Triton backend. - All services inherit the PaddleX HPS runtime (HTTP :8000, gRPC :8001, Metrics :8002) from
paddle/hps:paddlex3.1-gpu, with additional Python dependencies for PyMuPDF, BLIP, and Triton clients installed in theDockerfile.
+--------------+ gRPC/HTTP +------------+
| paddlex-server| <--------------------> | clients |
| layout-parsing | +------------+
| status_server | ^
+---------+--------+ |
| async captions |
v |
+---------+--------+ gRPC:8004 / HTTP:8003 |
| blip-server |<----------------------------+
| blip-caption |
+------------------+
- Instance-aware scheduling:
server/model_repo/layout-parsing/1/model.pyintegratesInstanceStatusTrackerwith/tmp/triton_instance_status_*.jsonheartbeats, exposes configured/active counts, and honours overrides fromTRITON_INSTANCE_COUNTandconfig_*.pbtxt3. - Async image captioning pipeline:
CaptionCoordinatorinlayout_captioning.pystreams markdown images to the BLIP Triton model, merges captions back intoparsing_res_list, and is gated bycaption_config.yamlplus the globalIMAGE_CAPTIONING_ENABLEDenv flag4. - Memory hygiene for long PDFs: Page rendering streams via bounded queues,
gc.collect()is invoked per page,paddle.device.cuda.empty_cache()is called after each request, and aPADDLE_TRIM_CACHE_EVERY_Nknob controls periodic GPU cache trimming3. - Resilient media uploads:
app_common.postprocess_imagesis wrapped with_POSTPROCESS_UPLOAD_TIMEOUT_SECONDS, giving fast failures and actionable logs when downstream storage stalls. - Consolidated configuration: Model/backend matrices live in
configFiles/modelSupport.jsonandconfigFiles/modeSupportGpu.json, while pipeline variants for Paddle, Paddle+HP, and ONNX Runtime runtimes sit alongside the shippingserver/pipeline_config.yaml.
server/model_repo/layout-parsing/1/– Triton Python backend using PaddleX pipeline, caption orchestration, and active-instance tracking.server/model_repo/blip-caption/1/– Triton Python backend that lazy-loads and cachesSalesforce/blip-image-captioning-largewith JSON IO.server/status_server.py– threadsafe HTTP server that aggregates/tmp/*_instance_status*.json.configFiles/– curated runtime configs (pipeline_config *.yaml) and backend capability matrices (modelSupport.json,modeSupportGpu.json).Dockerfile,docker-compose*.yml– container build/test orchestration for local dev and production.client/– lightweight gRPC helper (client.py) and requirements for smoke tests against :8001.
Choose the right pipeline manifest and set PADDLEX_HPS_PIPELINE_CONFIG_PATH before launching server.sh:
| File | Purpose |
|---|---|
server/pipeline_config.yaml |
Default shipping config with Paddle backend and tuned thresholds for production. |
configFiles/pipeline_config Paddle.yaml |
Legacy pure-Paddle baseline (HP Inference disabled) for compatibility testing. |
configFiles/pipeline_config paddleHpiTrue.yaml |
Paddle backend with use_hpip: True to enable High Performance Inference acceleration on supported modules. |
configFiles/pipeline_config onnxRuntime.yaml |
Hybrid config pushing layout detection and OCR classifiers through ONNX Runtime for lower latency on NVIDIA GPUs. |
server/server.sh copies the chosen manifest into the runtime image, resolves config_{gpu,cpu}.pbtxt symlinks, and boots Triton with --model-repository set to /paddlex/var/paddlex_model_repo.
- Layout parsing uses
config_gpu_paddlex.pbtxt(instance_group.count: 6) for GPU deployments andserver/model_repo/layout-parsing/config_cpu.pbtxtfor CPU fallbacks. - BLIP captioning relies on
config_gpu_blip.pbtxt/server/model_repo/blip-caption/config_gpu.pbtxt(instance_group.count: 4, dynamic batching up to 16) and shares the same image. - Update
TRITON_INSTANCE_COUNTto ensure the status tracker reflects manual scaling, or edit thecount:fields and reload the model.
We expose PaddleX model-to-backend compatibility as JSON so downstream services can reason about runtime choices:
See the full JSON files for the exhaustive matrix.
Prerequisites: Docker Engine with NVIDIA Container Toolkit, CUDA-capable GPU (≥12 GB recommended), and access to the PaddleX3.1 base image registry.
# Build the Triton-serving image
docker build -t paddlex/app:latest .# Start layout parsing only (captioning disabled, lighter footprint)
IMAGE_CAPTIONING_ENABLED=false docker compose up paddlex-server
# Start both layout parsing and blip-caption (captioning enabled)
COMPOSE_PROFILES=image-captioning IMAGE_CAPTIONING_ENABLED=true \
docker compose up
# Tail logs
docker logs -f paddlex-server
docker logs -f blip-server # only if profile enableddocker-compose.ymlmounts the repo into/app, reusesconfig_gpu_paddlex.pbtxt, and exposes ports8000/8001/8002/8081.- Captioning defaults on; flip
IMAGE_CAPTIONING_ENABLEDor remove theimage-captioningprofile to opt out.
# Build once, then deploy with production compose
docker compose -f docker-compose.prod.yml build
# Keep captioning enabled
COMPOSE_PROFILES=image-captioning IMAGE_CAPTIONING_ENABLED=true \
docker compose -f docker-compose.prod.yml up -d
# Or disable captioning globally while keeping layout parsing online
IMAGE_CAPTIONING_ENABLED=false \
docker compose -f docker-compose.prod.yml up -d paddlex-serverProduction compose avoids bind-mounting the workspace, instead mounting curated model caches (paddlex/official_models) and optional BLIP cache directories. Ensure the external Docker network xyne exists (docker network create xyne) before bringing up the stack.
For on-host Triton, execute server/server.sh after setting:
export PADDLEX_HPS_PIPELINE_CONFIG_PATH=/path/to/pipeline_config.yaml
export IMAGE_CAPTIONING_ENABLED=true
./server/server.shThe script mirrors the repo to /paddlex/var/paddlex_model_repo, boots the status server, and launches Triton with explicit model loads.
- Health metrics live at
/paddlex_instance_status(layout-parsing active/idle counts) and/instance_status(layout + BLIP configured instances) on port8081. - Runtime tuning knobs:
TRITON_POSTPROCESS_TIMEOUT_SECONDS– fail fast when image uploads hang.PADDLE_TRIM_CACHE_EVERY_N– forcepaddle.device.cuda.empty_cache()periodically for long-running jobs.IMAGE_CAPTIONING_ENABLED– global kill-switch for caption orchestration.TRITON_INSTANCE_COUNT– advertises pre-scaled pool sizes to the status tracker.
The client/ folder includes a minimal smoke-test harness:
pip install -r client/requirements.txt
python client/client.py --file path/to/document.pdf --file-type 0 --url localhost:8001Generated markdown and images are written to ./markdown_<page>/, mirroring how downstream services consume the layout parsing output.
We evaluated representative PDFs (Juspay pitch deck, LIC forms) across backends and GPUs. Times are wall-clock per document with a single instance unless noted.
| Mode | GPU / Machine | Document | Pages | Size | Latency | Peak GPU Mem | Quality | Notes |
|---|---|---|---|---|---|---|---|---|
| Paddle basic (UAT) | RTX 4090 (local) | juspay.pdf | 41 | 22 MB | 28 s | 6 GB + spike | Excellent | Baseline profile currently in UAT. |
| Paddle basic (UAT) | L4 (UAT) | juspay.pdf | 41 | 22 MB | 112 s | 6 GB + spike | Excellent | Longer wall-time on L4 due to lower clocks. |
| Paddle basic (UAT) | RTX 4090 (local) | LIC.pdf | 93 | 12 MB | 65 s | 6 GB + spike | Excellent | — |
| Paddle basic (UAT) | L4 (UAT) | LIC.pdf | 93 | 12 MB | 150–180 s | 6 GB + spike | Excellent | — |
| TensorRT fp16 | RTX 4090 (local) | juspay.pdf | 41 | 22 MB | 13 s | 9 GB + 4 GB | Poor | Conversion skipped layers → unacceptable accuracy. |
| TensorRT fp32 | RTX 4090 (local) | juspay.pdf | 41 | 22 MB | 18 s | 12 GB + 4 GB | Poor | Same degradation as fp16. |
| ONNX Runtime | RTX 4090 (local) | juspay.pdf | 41 | 22 MB | 19 s | 12 GB + 6 GB | Excellent | Competitive alternative. |
| Paddle + TensorRT subgraph fp32 | RTX 4090 (local) | juspay.pdf | 41 | 22 MB | 17 s | 6 GB + 2.5 GB | Excellent | Selected for production rollout. |
| Paddle + TensorRT subgraph fp32 | A100 | juspay.pdf | 41 | 22 MB | 53 s | 6 GB + 2.5 GB | Excellent | Multi-instance friendly. |
| Paddle + TensorRT subgraph fp32 | RTX 4090 (local) | LIC.pdf | 93 | 12 MB | 35 s | 6 GB + 2.5 GB | Excellent | 40 % faster than UAT baseline. |
| Paddle + TensorRT subgraph fp32 | A100 | LIC.pdf | 93 | 12 MB | 92 s | 6 GB + 2.5 GB | Excellent | Up to 7 instances per A100 feasible. |
Notes:
- Latency grows ~30 % when scaling out instances; the status tracker helps right-size pools.
- Formula recognition and seal detection remain disabled to prioritise throughput (adds 0.5–1 s per page when enabled).
- Horizontal scaling is VRAM-bound; A100 deployments comfortably host seven concurrent instances (vs three on UAT).
- PaddleOCR, PP-StructureV3, and PaddleX assets are Apache License 2.0. Review upstream license text at PaddleOCR.
- BLIP weights (
Salesforce/blip-image-captioning-large) inherit their respective Hugging Face licensing terms. - Cite the upstream research when publishing results built on this stack:
Footnotes
-
PaddleX HPS PP-StructureV3 SDK download – https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/deploy/paddlex_hps/public/sdks/v3.3/paddlex_hps_PP-StructureV3_sdk.tar.gz ↩
-
PaddleX High-Stability Serving documentation – https://paddlepaddle.github.io/PaddleX/latest/en/pipeline_deploy/serving.html#13-invoke-the-service ↩
-
server/model_repo/layout-parsing/1/layout_captioning.py↩ -
Cheng Cui et al. PaddleOCR 3.0 Technical Report. arXiv:2507.05595, 2025. ↩
-
Cheng Cui et al. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model. arXiv:2510.14528, 2025. ↩