Everything you need to know about Fleek, our pricing, and how we make AI inference faster and cheaper.
AI inference platform. We serve optimized versions of popular models via API. Bring your own models too—we'll optimize them. 70% lower cost, 3x faster, zero quality loss.
Most inference platforms just host your model on a GPU. We optimize two layers: 1. Model — NVFP4 quantization, custom kernels, precision tuning. 3x faster. 2. GPU — MicroVM infrastructure, sub-second cold starts, 95%+ utilization vs. 30-50% typical. Both layers compound. That's where the 70% savings come from.
At launch: popular open-source models for image, video, multimodal, and select LLMs (DeepSeek R1, Llama, FLUX, etc.). Custom model optimization is coming soon—you'll be able to upload any model (public or private) and get the same optimization at the same $0.0025/GPU-sec pricing.
Coming soon. We're building support for any model—not just open-source. Upload your fine-tuned weights, proprietary model, or any PyTorch model. Same optimization process, same pricing, no custom model premium. Launching in the coming weeks.
Our research lab. Weyl focuses on fundamental breakthroughs in efficient inference. The work there powers Fleek's products.
$0.0025 per GPU-second. Not per token, not per image. You pay for compute time. When our optimization makes models faster, you pay less automatically. Save up to 70% vs competitors.
One second of GPU compute time. Simple, transparent pricing at $0.0025 per GPU-second. Our optimization gains pass directly to you as lower costs.
The range reflects our current B200 infrastructure and our optimized GB200 NVL72 infrastructure as it rolls out. The lower price is what you'll pay as we deploy the new hardware. You automatically get the best available rate.
No. Same rates. When custom model optimization launches (coming soon), you'll paste a HuggingFace URL or upload your private weights, we optimize it, and the same $0.0025/GPU-sec pricing applies. No custom model premium.
Yes. Monthly cap in your dashboard. Alerts at 80%, hard stop at 100%.
We built our own inference stack with s4 codegen triggering native Blackwell FP4 tactics. We achieve industry-leading throughput, and our GPU-second pricing passes those gains directly to you.
Each tenant gets isolated GPU access through our abstraction layer. Your workloads can't see or interfere with other tenants. We handle the multiplexing—you just send requests.
On the roadmap. Same optimization stack, smaller devices. Jetson Orin, Xavier, NVIDIA Thor.
We run on NVIDIA B200, GB200 NVL72, and RTX PRO 6000 (all Blackwell architecture) with native FP4 support. DGX Spark and Jetson Thor support coming soon for edge and embedded AI deployments.
Yes. Beyond standard encryption and SOC 2 Type II (in progress), our infrastructure is built on formally verified foundations. Core components are proven correct in Lean4 with cryptographic attestation at every layer. This isn't marketing—it's math. Enterprise customers can get private deployments with VPCs, audit logging, and access to our verification proofs. Contact us to learn more.
No. You only pay for active compute time. When your inference request completes, billing stops. No idle charges, no minimums, no reserved capacity fees.
Free tier: Email support and Discord community. Pro: Priority email support. Enterprise: Dedicated Slack channel, 24/7 support, and a named account manager.
Yes. Enterprise customers with high-volume workloads qualify for custom pricing. Contact sales to discuss your specific needs.
Yes. Enterprise on-prem deployment is available. Bring your own GPUs and we'll run our optimization stack on your hardware. Contact sales to get started.
On the roadmap. Claude, Cursor, Windsurf integration coming.
Full REST API with OpenAPI spec at launch. Python SDK planned.
$5 in credits. No card required.