Spark Cluster Management

Ansible project for managing a small cluster with high-performance networking backbone.

Features

Automatically setup SSH keys
Configure cluster backbone network with jumbo frames and bonding
Automated network performance testing
Easy cluster management with Ansible

Prerequisites

uv - Fast Python package installer (Install uv)
Two NVIDIA DGX Sparks connected by a QSFP cable

Getting Started

1. Install Dependencies

uv sync

This installs Ansible and all required collections.

2. Configure Your Inventory

Edit inventory/hosts.yml to match your cluster:

Update hostnames and IP addresses:

spark-0:
  ansible_host: 192.168.0.194  # Change to your node's IP/hostname
spark-1:
  ansible_host: 192.168.0.221  # Change to your node's IP/hostname

Select correct network interfaces for bonding:

The bond members should be the interfaces you want to bond together. Interface naming:

np0 = left port
np1 = right port

Example:

cluster_bond:
  members:
    - enp1s0f0np0    # First interface, left port
    - enP2p1s0f0np0  # Second interface, left port
  address: 172.16.0.1/30  # Cluster backbone IP
  mtu: 9000               # Jumbo frames

Verify interface names on your nodes with: ip link show

3. Run Initial Setup

Configure SSH keys and passwordless sudo:

uv run ./playbooks/setup.yml

Enter your SSH password when prompted. This will:

Update all packages and reboot if needed
Generate SSH keypair in data/
Configure passwordless SSH access
Enable passwordless sudo

Undo these changes.

uv run ./playbooks/setup.yml -e "state=absent"

4. Configure Cluster Backbone

Set up network bonding with performance testing:

uv run ./playbooks/cluster-backbone.yml

This will:

Install NetworkManager and iperf3
Create bonded network interfaces (mode: balance-xor)
Configure jumbo frames (MTU 9000)
Run automated performance tests between nodes

Undo these changes.

uv run ./playbooks/cluster-backbone.yml -e "state=absent"

5. (Optional) Configure Local SSH Client

Make it easy to SSH to cluster nodes from your workstation:

uv run ./playbooks/configure-ssh-client.yml

After this, you can simply: ssh spark-0 or ssh spark-1

Undo these changes.

uv run ./playbooks/configure-ssh-client.yml -e "state=absent"

Maintenance

Update and Reboot Nodes

Keep your cluster up to date:

uv run ./playbooks/update.yml

This will:

Update package cache
Upgrade all packages
Refresh firmware metadata and upgrade firmware
Check if reboot is required
Automatically reboot nodes that need it
Wait for nodes to come back online

Install NVIDIA Workbench

Install NVIDIA Workbench on all cluster nodes:

uv run ./playbooks/install-workbench.yml

This will:

Download the latest nvwb-cli for your platform
Install NVIDIA Workbench with Docker support
Run in non-interactive mode

Uninstall NVIDIA Workbench.

uv run ./playbooks/install-workbench.yml -e "state=absent"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
data		data
inventory		inventory
playbooks		playbooks
templates		templates
.ansible-lint		.ansible-lint
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
ansible.cfg		ansible.cfg
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Cluster Management

Features

Prerequisites

Getting Started

1. Install Dependencies

2. Configure Your Inventory

3. Run Initial Setup

4. Configure Cluster Backbone

5. (Optional) Configure Local SSH Client

Maintenance

Update and Reboot Nodes

Install NVIDIA Workbench

About

Uh oh!

Releases

Packages

Languages

rmkraus/spark-cluster

Folders and files

Latest commit

History

Repository files navigation

Spark Cluster Management

Features

Prerequisites

Getting Started

1. Install Dependencies

2. Configure Your Inventory

3. Run Initial Setup

4. Configure Cluster Backbone

5. (Optional) Configure Local SSH Client

Maintenance

Update and Reboot Nodes

Install NVIDIA Workbench

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages