🧠 LLM Resource Planner

Estimate GPU VRAM requirements for Hugging Face LLMs without downloading model weights.

The LLM Resource Planner is a lightweight Python CLI tool that analyzes Hugging Face model configurations and estimates the GPU memory required for inference.

It enables developers to perform AI infrastructure planning before downloading large model checkpoints.

🚀 Quick Start

Install

pip install llm-resource-planner

Run the planner

llm-plan microsoft/Phi-3.5-mini-instruct

Example output:

--- Analyzing microsoft/Phi-3.5-mini-instruct ---
Estimated Parameters: ~3.62B
Memory (Weights): 6.75 GB
Memory (KV Cache @ 4k): 1.50 GB
Total Recommended VRAM: 8.55 GB

Example Models

llm-plan meta-llama/Meta-Llama-3-8B
llm-plan mistralai/Mistral-7B-Instruct
llm-plan microsoft/Phi-3.5-mini-instruct

CLI Usage

Show command help:

llm-plan --help

Basic usage:

llm-plan <huggingface-model-id>

Example:

llm-plan meta-llama/Meta-Llama-3-8B

GPU Fit Check

You can optionally check whether a model fits within a given GPU memory budget:

llm-plan meta-llama/Meta-Llama-3-8B --gpu 24

Example output:

Total Recommended VRAM: 19.82 GB

GPU Memory Provided: 24.00 GB
✔ Model should fit in available VRAM

What the Tool Does

The planner retrieves a model's configuration metadata from Hugging Face using:

transformers.AutoConfig

It extracts architectural parameters such as:

hidden size
number of transformer layers
number of attention heads

Using these values, the tool estimates:

Model parameter count
Memory required for model weights
Memory required for the attention KV cache
A buffered VRAM estimate for inference

This analysis occurs without downloading model weights.

Estimation Method

The tool uses a heuristic approximation commonly applied to transformer architectures.

Parameter Count Estimate

params ≈ hidden_size² × num_layers × 12

This approximates the parameter count for standard transformer blocks.

Weight Memory

weight_memory = params × dtype_bytes

Where precision is assumed to be:

Precision	Bytes
FP32	4
FP16	2
INT8	1
INT4	0.5

(Current CLI defaults to FP16.)

KV Cache Estimate

The KV cache memory is approximated as:

kv_cache = 2 × hidden_size × num_layers × bytes_per_param × context_length

The current implementation assumes:

context_length = 4096

Recommended VRAM

A safety margin is applied:

total_vram ≈ weight_memory + (kv_cache × 1.2)

This accounts for runtime memory overhead.

Authentication

Some Hugging Face models require authentication.

Set your Hugging Face token:

export HUGGINGFACE_API_TOKEN="your_token_here"

The planner will automatically use the token when retrieving model metadata.

Development Installation

Clone the repository:

git clone https://github.com/deepagency/llm-resource-planner.git
cd llm-resource-planner

Install in editable mode:

pip install -e .

Run the tool:

llm-plan microsoft/Phi-3.5-mini-instruct

Assumptions and Limitations

This tool provides heuristic estimates.

Results may differ depending on:

inference engine (vLLM, Ollama, TensorRT-LLM, etc.)
batching strategies
runtime graph optimizations
GPU memory fragmentation
custom model architectures

The estimator is primarily designed for standard transformer architectures.

For production deployments, maintain a 10–20% safety margin.

🤝 Contributing

Contributions are welcome.

If you discover:

models producing inaccurate estimates
improved parameter estimation heuristics
support for additional architectures

please open an Issue or submit a Pull Request.

See CONTRIBUTING.md for development guidelines.

📄 License

This project is licensed under the MIT License.

See the LICENSE file for details.

Built for the open-source AI community.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
llm_resource_planner		llm_resource_planner
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 LLM Resource Planner

🚀 Quick Start

Install

Run the planner

Example Models

CLI Usage

GPU Fit Check

What the Tool Does

Estimation Method

Parameter Count Estimate

Weight Memory

KV Cache Estimate

Recommended VRAM

Authentication

Development Installation

Assumptions and Limitations

🤝 Contributing

📄 License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 LLM Resource Planner

🚀 Quick Start

Install

Run the planner

Example Models

CLI Usage

GPU Fit Check

What the Tool Does

Estimation Method

Parameter Count Estimate

Weight Memory

KV Cache Estimate

Recommended VRAM

Authentication

Development Installation

Assumptions and Limitations

🤝 Contributing

📄 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages