Ansible project for managing a small cluster with high-performance networking backbone.
- Automatically setup SSH keys
- Configure cluster backbone network with jumbo frames and bonding
- Automated network performance testing
- Easy cluster management with Ansible
- uv - Fast Python package installer (Install uv)
- Two NVIDIA DGX Sparks connected by a QSFP cable
uv syncThis installs Ansible and all required collections.
Edit inventory/hosts.yml to match your cluster:
Update hostnames and IP addresses:
spark-0:
ansible_host: 192.168.0.194 # Change to your node's IP/hostname
spark-1:
ansible_host: 192.168.0.221 # Change to your node's IP/hostnameSelect correct network interfaces for bonding:
The bond members should be the interfaces you want to bond together. Interface naming:
np0= left portnp1= right port
Example:
cluster_bond:
members:
- enp1s0f0np0 # First interface, left port
- enP2p1s0f0np0 # Second interface, left port
address: 172.16.0.1/30 # Cluster backbone IP
mtu: 9000 # Jumbo framesVerify interface names on your nodes with: ip link show
Configure SSH keys and passwordless sudo:
uv run ./playbooks/setup.ymlEnter your SSH password when prompted. This will:
- Update all packages and reboot if needed
- Generate SSH keypair in
data/ - Configure passwordless SSH access
- Enable passwordless sudo
Undo these changes.
uv run ./playbooks/setup.yml -e "state=absent"
Set up network bonding with performance testing:
uv run ./playbooks/cluster-backbone.ymlThis will:
- Install NetworkManager and iperf3
- Create bonded network interfaces (mode: balance-xor)
- Configure jumbo frames (MTU 9000)
- Run automated performance tests between nodes
Undo these changes.
uv run ./playbooks/cluster-backbone.yml -e "state=absent"
Make it easy to SSH to cluster nodes from your workstation:
uv run ./playbooks/configure-ssh-client.ymlAfter this, you can simply: ssh spark-0 or ssh spark-1
Undo these changes.
uv run ./playbooks/configure-ssh-client.yml -e "state=absent"
Keep your cluster up to date:
uv run ./playbooks/update.ymlThis will:
- Update package cache
- Upgrade all packages
- Refresh firmware metadata and upgrade firmware
- Check if reboot is required
- Automatically reboot nodes that need it
- Wait for nodes to come back online
Install NVIDIA Workbench on all cluster nodes:
uv run ./playbooks/install-workbench.ymlThis will:
- Download the latest nvwb-cli for your platform
- Install NVIDIA Workbench with Docker support
- Run in non-interactive mode
Uninstall NVIDIA Workbench.
uv run ./playbooks/install-workbench.yml -e "state=absent"