Estimate GPU VRAM requirements for Hugging Face LLMs without downloading model weights.
The LLM Resource Planner is a lightweight Python CLI tool that analyzes Hugging Face model configurations and estimates the GPU memory required for inference.
It enables developers to perform AI infrastructure planning before downloading large model checkpoints.
pip install llm-resource-plannerllm-plan microsoft/Phi-3.5-mini-instructExample output:
--- Analyzing microsoft/Phi-3.5-mini-instruct ---
Estimated Parameters: ~3.62B
Memory (Weights): 6.75 GB
Memory (KV Cache @ 4k): 1.50 GB
Total Recommended VRAM: 8.55 GB
llm-plan meta-llama/Meta-Llama-3-8B
llm-plan mistralai/Mistral-7B-Instruct
llm-plan microsoft/Phi-3.5-mini-instructShow command help:
llm-plan --helpBasic usage:
llm-plan <huggingface-model-id>Example:
llm-plan meta-llama/Meta-Llama-3-8BYou can optionally check whether a model fits within a given GPU memory budget:
llm-plan meta-llama/Meta-Llama-3-8B --gpu 24Example output:
Total Recommended VRAM: 19.82 GB
GPU Memory Provided: 24.00 GB
β Model should fit in available VRAM
The planner retrieves a model's configuration metadata from Hugging Face using:
transformers.AutoConfig
It extracts architectural parameters such as:
- hidden size
- number of transformer layers
- number of attention heads
Using these values, the tool estimates:
- Model parameter count
- Memory required for model weights
- Memory required for the attention KV cache
- A buffered VRAM estimate for inference
This analysis occurs without downloading model weights.
The tool uses a heuristic approximation commonly applied to transformer architectures.
params β hidden_sizeΒ² Γ num_layers Γ 12
This approximates the parameter count for standard transformer blocks.
weight_memory = params Γ dtype_bytes
Where precision is assumed to be:
| Precision | Bytes |
|---|---|
| FP32 | 4 |
| FP16 | 2 |
| INT8 | 1 |
| INT4 | 0.5 |
(Current CLI defaults to FP16.)
The KV cache memory is approximated as:
kv_cache = 2 Γ hidden_size Γ num_layers Γ bytes_per_param Γ context_length
The current implementation assumes:
context_length = 4096
A safety margin is applied:
total_vram β weight_memory + (kv_cache Γ 1.2)
This accounts for runtime memory overhead.
Some Hugging Face models require authentication.
Set your Hugging Face token:
export HUGGINGFACE_API_TOKEN="your_token_here"The planner will automatically use the token when retrieving model metadata.
Clone the repository:
git clone https://github.com/deepagency/llm-resource-planner.git
cd llm-resource-plannerInstall in editable mode:
pip install -e .Run the tool:
llm-plan microsoft/Phi-3.5-mini-instructThis tool provides heuristic estimates.
Results may differ depending on:
- inference engine (
vLLM,Ollama,TensorRT-LLM, etc.) - batching strategies
- runtime graph optimizations
- GPU memory fragmentation
- custom model architectures
The estimator is primarily designed for standard transformer architectures.
For production deployments, maintain a 10β20% safety margin.
Contributions are welcome.
If you discover:
- models producing inaccurate estimates
- improved parameter estimation heuristics
- support for additional architectures
please open an Issue or submit a Pull Request.
See CONTRIBUTING.md for development guidelines.
This project is licensed under the MIT License.
See the LICENSE file for details.
Built for the open-source AI community.