Thanks to visit codestin.com
Credit goes to github.com

Skip to content

deepagency/llm-resource-planner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 LLM Resource Planner

Estimate GPU VRAM requirements for Hugging Face LLMs without downloading model weights.

The LLM Resource Planner is a lightweight Python CLI tool that analyzes Hugging Face model configurations and estimates the GPU memory required for inference.

It enables developers to perform AI infrastructure planning before downloading large model checkpoints.

License: MIT Python 3.8+ PyPI Maintenance


πŸš€ Quick Start

Install

pip install llm-resource-planner

Run the planner

llm-plan microsoft/Phi-3.5-mini-instruct

Example output:

--- Analyzing microsoft/Phi-3.5-mini-instruct ---
Estimated Parameters: ~3.62B
Memory (Weights): 6.75 GB
Memory (KV Cache @ 4k): 1.50 GB
Total Recommended VRAM: 8.55 GB

Example Models

llm-plan meta-llama/Meta-Llama-3-8B
llm-plan mistralai/Mistral-7B-Instruct
llm-plan microsoft/Phi-3.5-mini-instruct

CLI Usage

Show command help:

llm-plan --help

Basic usage:

llm-plan <huggingface-model-id>

Example:

llm-plan meta-llama/Meta-Llama-3-8B

GPU Fit Check

You can optionally check whether a model fits within a given GPU memory budget:

llm-plan meta-llama/Meta-Llama-3-8B --gpu 24

Example output:

Total Recommended VRAM: 19.82 GB

GPU Memory Provided: 24.00 GB
βœ” Model should fit in available VRAM

What the Tool Does

The planner retrieves a model's configuration metadata from Hugging Face using:

transformers.AutoConfig

It extracts architectural parameters such as:

  • hidden size
  • number of transformer layers
  • number of attention heads

Using these values, the tool estimates:

  1. Model parameter count
  2. Memory required for model weights
  3. Memory required for the attention KV cache
  4. A buffered VRAM estimate for inference

This analysis occurs without downloading model weights.


Estimation Method

The tool uses a heuristic approximation commonly applied to transformer architectures.

Parameter Count Estimate

params β‰ˆ hidden_sizeΒ² Γ— num_layers Γ— 12

This approximates the parameter count for standard transformer blocks.


Weight Memory

weight_memory = params Γ— dtype_bytes

Where precision is assumed to be:

Precision Bytes
FP32 4
FP16 2
INT8 1
INT4 0.5

(Current CLI defaults to FP16.)


KV Cache Estimate

The KV cache memory is approximated as:

kv_cache = 2 Γ— hidden_size Γ— num_layers Γ— bytes_per_param Γ— context_length

The current implementation assumes:

context_length = 4096

Recommended VRAM

A safety margin is applied:

total_vram β‰ˆ weight_memory + (kv_cache Γ— 1.2)

This accounts for runtime memory overhead.


Authentication

Some Hugging Face models require authentication.

Set your Hugging Face token:

export HUGGINGFACE_API_TOKEN="your_token_here"

The planner will automatically use the token when retrieving model metadata.


Development Installation

Clone the repository:

git clone https://github.com/deepagency/llm-resource-planner.git
cd llm-resource-planner

Install in editable mode:

pip install -e .

Run the tool:

llm-plan microsoft/Phi-3.5-mini-instruct

Assumptions and Limitations

This tool provides heuristic estimates.

Results may differ depending on:

  • inference engine (vLLM, Ollama, TensorRT-LLM, etc.)
  • batching strategies
  • runtime graph optimizations
  • GPU memory fragmentation
  • custom model architectures

The estimator is primarily designed for standard transformer architectures.

For production deployments, maintain a 10–20% safety margin.


🀝 Contributing

Contributions are welcome.

If you discover:

  • models producing inaccurate estimates
  • improved parameter estimation heuristics
  • support for additional architectures

please open an Issue or submit a Pull Request.

See CONTRIBUTING.md for development guidelines.


πŸ“„ License

This project is licensed under the MIT License.

See the LICENSE file for details.


Built for the open-source AI community.

About

A simple CLI tool to fetch Hugging Face model metadata and estimate required VRAM/RAM for inference.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages