A smart GPU resource manager that automatically holds idle GPU memory and maintains controlled utilization to prevent resource preemption.
doma (DOg in the MAnager) is a lightweight daemon tool designed to intelligently occupy idle GPU resources. It monitors GPU usage patterns and automatically claims memory and maintains specified utilization levels when GPUs become idle, preventing resource preemption because of low utilization.
- π€ Automatic GPU Detection: Monitors all available CUDA GPUs automatically
- β±οΈ Smart Idle Detection: Waits for configurable idle periods before claiming resources
- ποΈ Precise Utilization Control: Maintains target GPU utilization using adaptive algorithms
- πΎ Memory Management: Configurable memory holding with automatic cleanup
- π§ Daemon Architecture: Runs as a background service with socket-based control
- π Real-time Monitoring: Continuous tracking of GPU memory and utilization metrics
- π‘οΈ Safe Resource Handling: Graceful cleanup and release of GPU resources
# Install using uv (recommended)
git clone <repository-url>
cd doma
uv tool install ./-
Launch the doma server:
doma launch
-
Start holding idle GPUs:
doma start
-
Check server status:
doma status
-
Stop holding GPUs (keeps server running):
doma stop
-
Shutdown the server:
doma shutdown
Starts the doma daemon server in the background.
Options:
--log-path: Path to log file (default:/tmp/doma/doma.log)
Begins monitoring and holding idle GPUs with specified configuration.
Options:
--wait-minutes: Minutes to wait before holding GPU (default: 10)--mem-threshold: Memory threshold in GB for idle detection (default: 0.5)--hold-mem: Memory to hold in GB (default: 10GB)--hold-util: Target GPU utilization to maintain (0-1, default: 0.5)
Algorithm Options:
--operator-gb: Operator size in GB for control precision (default: 1.0)--util-eps: Utilization epsilon for convergence (default: 0.01)--max-sleep-time: Init maximum sleep time in seconds of binary search (default: 1)--min-sleep-time: Init minimum sleep time in seconds of binary search (default: 0)--inspect-interval: Interval in seconds to inspect GPU utilization during binary search (default: 1)--util-samples-num: Number of samples to take for utilization during binary search (default: 5)
Releases all GPUs and restarts with new configuration.
Stops holding GPUs and releases all resources (server continues running).
Completely shuts down the doma server.
Shows current server status.
Doma continuously monitors each GPU's memory usage and utilization. A GPU is considered "idle" when:
- Memory usage stays below the configured threshold (
--mem-threshold) - This condition persists for the specified waiting period (
--wait-minutes)
When a GPU becomes idle, doma:
- Allocates the specified amount of memory (
--hold-mem) - Maintains target utilization (
--hold-util) through controlled compute operations - Uses adaptive algorithms to precisely control utilization levels
Resources are automatically released when:
- The
stopcommand is issued - The server is shut down
doma start --wait-minutes 15 --hold-util 0.3 --hold-mem 2.0doma start --wait-minutes 5 --hold-util 0.8 --mem-threshold 0.1doma start --util-eps 0.005 --operator-gb 0.5 --util-samples-num 10doma launch --log-path /var/log/doma/doma.log# Change configuration without restarting server
doma restart --hold-util 0.7 --wait-minutes 5# Launch with custom log path
doma launch --log-path /opt/doma/logs/doma.log
# Start with production settings
doma start --wait-minutes 20 --hold-util 0.6 --mem-threshold 0.5- Python β₯ 3.11
- CUDA-capable GPU(s)
- PyTorch with CUDA support
- NVIDIA drivers
git clone <repository-url>
cd doma
uv sync --group dev- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Resource Management: Doma is designed for responsible resource sharing. Always ensure you have permission to use GPU resources in shared environments.
- Memory Safety: The tool includes automatic cleanup mechanisms, but system crashes may require manual GPU memory cleanup.
- Compatibility: Requires NVIDIA GPUs with CUDA support. AMD GPUs are not currently supported.
- Performance Impact: Holding operations use minimal resources but may slightly impact system performance.
- GPU Memory Calculation: Doma uses the
torch.cuda.device_memory_usedto calculate the GPU memory. It may not be the same as thenvidia-smicommand.
Server won't start:
# Check if socket file exists
ls -la /tmp/doma/
# Remove if necessary
rm -f /tmp/doma/doma.sockGPU memory not released:
# Force shutdown and restart
doma shutdown
# Wait a moment, then relaunch
doma launchPermission issues:
# Ensure proper CUDA permissions
nvidia-smi
# Check if user has access to CUDA devicesAuthor: TideDra ([email protected])
Version: 0.1.0