Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Node0: A collaborative event powered by Protocol Learning, our decentralized approach to AI development

License

Notifications You must be signed in to change notification settings

TymurTarasov/node0

 
 

Repository files navigation

Dashboard Website Discord

✨ Description

Node0 is a collaborative event powered by Protocol Learning, our decentralized approach to AI development.

This is the first pretraining run open to the public that can use commodity hardware, so anyone can join and help train this model. This training run supports participants with compute resources as small as a 16GB consumer GPU (e.g., an Nvidia 3090). For the first time, small devices are contributing to a global training cluster, demonstrating that the online community can collaboratively train massive models.

Each participant (node) holds a small portion of the model's computation graph. Despite being connected over the internet (which has 100x slower speeds compared to datacenter connections), we are able to achieve training speeds and performances on par with centralized systems, utilizing the same amount of compute. The swarm of nodes is fully dynamic, meaning participants can join and leave at any time. As a result, the system scales horizontally – the more nodes join, the faster training proceeds.

Because decentralized protocols allow us to pool compute resources from many sources, they enable access to more computational power than is typically available in centralized datacenters. This opens up the possibility to train larger models than ever before, giving the community the ability to push the boundaries of AI training.

This is a hands-on demonstration of our vision: to make AI more accessible and less centralized by pooling compute resources from a global community.

Each participant's progress is logged - check their contribution in the dashboard.

📑 Table of Contents

📋 Requirements

Hardware Requirements

PC/Server with Nvidia GPU:

  • Minimum 16GB GPU memory
  • Minimum 32GB RAM
  • Minimum 80GB disk space (required for Docker image)

Operating System Requirements

Network Requirements

The port that you are exposing (default is 49200) must be accessible to external connections. Follow this guide to open the port.

Cloud Services

Follow this guide for how to set up cloud instances that meet the requirements.

Authentication

Create HuggingFace access token (instruction). The token doesn't need any read/write permissions as it will only be used for authorization.

🔧 Installation

Checkout this list of useful links if you need further help with the installation.

Option 1: Using Docker (Preferred on Linux)

The easiest way to get started is using Docker:

# Clone repository
git clone https://github.com/PluralisResearch/node0
cd node0

# Build the image
docker build . -t pluralis_node0

Note: on some systems you may need to run all Docker commands with sudo.

If Docker fails to use GPU, configure Docker to use NVIDIA GPU as the default runtime:

sudo apt install jq

sudo touch /etc/docker/daemon.json

sudo bash -c 'jq ". += {\"default-runtime\": \"nvidia\", \"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]}" /etc/docker/daemon.json > /etc/docker/daemon.json.tmp && mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json'

sudo systemctl restart docker

Option 2: From Source (Preferred on WSL)

Prerequisites

  • Python 3.11
  • pip or conda package manager

Conda (install)

# Clone repository
git clone https://github.com/PluralisResearch/node0
cd node0

# Create conda environment
conda create -n node0 python=3.11
conda activate node0

# Install node0
pip install .

🚀 Usage

Generate starting script

To create the script that starts the server, run the following command:

Using Docker

python3 generate_script.py --use_docker

Without Docker

python3 generate_script.py

This command generates a start_server.sh file.

Parameters

The generate_script.py script requires a few authorization parameters:

  • HuggingFace token (follow instructions in the Requirements section to obtain one)
  • Email address (optional) - please provide your email address so we can get in touch with you

You can either enter these interactively (default) or pass them as command-line arguments:

  • --token <HF_token>
  • --email <email_address>

To skip interactive prompts entirely, use --skip_input flag.

Multiple GPUs

Currently, Node0 supports single-GPU training only.

If you're running on a multi-GPU machine, you can specify which GPU to use with the --gpu_id flag:

--gpu_id <ID>

By default, the script uses GPU 0.

If you want to run multiple training scripts (each one with a different GPU) on the same machine, each instance needs its own node0 repository folder (as otherwise private keys and logs get overridden). Each instance should also use a different host/announce port. See below how to change the default exposed port.

Changing exposed port

By default, Node0 exposes port 49200 for P2P connections. If you need to modify the exposed port, use the following flags:

--host_port <port>     # The internal port the library will listen on
--announce_port <port> # The external port other peers will connect to

In most cases, host_port and announce_port should be identical. However, some compute providers (e.g., RunPod) assign random external port mappings. In this case:

  1. Set --host_port to the internal port you want the library to listen on
  2. Set --announce_port to the external port that has been mapped by your provider

Start the server

To join the experiment, run:

./start_server.sh

The stdout from the Docker container is saved in the run.out file. To keep track of the training, you can use this command to see the file updates in the terminal:

tail -f run.out

If running outside of Docker, we recommend running the code inside a tmux session, which allows the script to persist when you disconnect from the machine and provides better control for resource cleanup when stopping the script.

Stop the server

You can stop the server at any time and rejoin the experiment later — your contribution will be saved.

Do not delete private.key file between runs (see Important section).

To stop the server, run:

Using Docker

# Find the container name or ID
docker ps

# Stop the server
docker stop <container_name_or_ID>

# Remove the container
docker rm <container_name_or_ID>

# Update file ownership
sudo chown -R <linux_user> .

Without Docker

Press Ctrl + Z in the terminal running the server to stop it.

Remove temporary files and free ports:

# Remove socket files
rm /tmp/hivemind*

# Install lsof (omit sudo if running on cloud providers without sudo access)
sudo apt-get install lsof

# Kill all processes using host port (default port number is 49200)
# (omit sudo if running on cloud providers without sudo access)
for i in $(sudo lsof -t -i tcp:<host_port>); do kill -9 $i; done

⚠️ Note: On some cloud providers sudo may not be available. In such cases, simply omit sudo from the commands above.

If the GPU memory is not released, you can run the following command to kill all python processes that use GPU. ⚠️ WARNING: do not run this if you have any other python code using GPU running as it will be killed as well.

# Install lsof (omit sudo if running on cloud providers without sudo access)
sudo apt-get install lsof

# Kill processes (omit sudo if running on cloud providers without sudo access)
for i in $(sudo lsof /dev/nvidia* | grep python  | awk '{print $2}' | sort -u); do kill -9 $i; done

Restart the server

If the server has been stopped, you can restart it as follows:

./start_server.sh

Make sure you've followed all the steps in the Stop the server section to properly stop the old run first.

⚠️ Note: If you experience any problems, try restarting your computer and then attempt to start the server again.

Verify training

It may take few minutes for the server to find other peers and join the training. Check the training logs (logs/server.log) or stdout to monitor the process. If you're running the code inside a Docker, stdout is saved in run.out file.

At first, you will see that authorization is completed and new parameters are downloaded from a peer:

INFO:node0.security.authorization: Access for user username has been granted until 2025-04-15 12:59:12.613600+00:00 UTC
INFO:node0.security.authorization: Authorization completed
 
 ...

INFO:hivemind.averaging.averager: Downloading parameters from peer <...>
INFO:hivemind.averaging.averager: Finished downloading state in 0.309s from <...>

After some time the training will start and you will see similar logs:

INFO:hivemind.moe.server.runtime: Processed 51 batches in last 60 seconds:
INFO:hivemind.moe.server.runtime: body2.0.919_backward: 27 batches (100.62 batches/s), 108 examples (402.50 examples/s), avg batch size 4.00
INFO:hivemind.moe.server.runtime: body2.0.919_forward: 24 batches (382.51 batches/s), 96 examples (1530.02 examples/s), avg batch size 4.00

Sanity check: a healthy peer will periodically report Processed [N] batches in last 60 seconds and Averaged gradients/parameters with [N > 1] peers.

🚨 Important

private.key

The code generates a private.key file during initial setup. This file:

  • Contains your node's cryptographic identity
  • Is required for secure communication within the network
  • Should be kept confidential and never shared publicly

Docker files

All files created within the Docker container have a different level of ownership. To modify/delete them outside of the container, you need to reclaim ownership.

Linux:

sudo chown -R <linux_user> <path/to/project>

🔍 Troubleshooting

Network-Related Issues

The following errors in logs are typically related to internet connectivity:

  • Averaging failed with <class 'TimeoutError'>

    This occurs when slow internet connections prevent averaging operations from completing in time. This is generally acceptable.

    ⚠️ Note: If you only see averaging errors and never see "Averaged gradients/averaged parameters" messages, your internet connection may be too slow for proper operation.

  • Averaging step failed: could not find a group

    Similar to the above error, this is caused by network connectivity issues during averaging operations.

  • An error occurred during the speed test, skipping

    Indicates that the library couldn't perform an internet speed test. This error can be safely ignored as it doesn't affect core functionality.

  • Your node cannot be reached by other peers. Please check that the announced port is open to inbound connections.

    Verify that your port is open for both inbound and outbound connections (follow this guide). Check that no firewall rules are blocking the connection.

  • Failed to load state from peers.

    This error is likely caused by the slow internet connection.

Authorization Issues

The following errors may occur during the authorization process:

  • Invalid token

    Verify that you've provided a valid Hugging Face token.

  • Verification failed

    If you've made local changes, revert them or use a clean copy of the repository. If you recently pulled new commits from the repository:

    • Reinstall the library (if installed from source)
    • Rebuild the Docker image (if using Docker)
  • This peer_id is already used by another user

    This error occurs when attempting to join an experiment with a different Hugging Face account. Delete the private.key file and try again.

📚 Citations

If you use this project in your research, please cite:

Gil Avraham*, Yan Zuo*, Violetta Shevchenko, Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin Hewa Koneputugodage, and Alexander Long. Node0: Model Parallel Training over the Internet with Protocol Models. 2025.

*equal contribution

@misc{avraham2025node0,
    title={Node0: Model Parallel Training over the Internet with Protocol Models}, 
    author={Gil Avraham and Yan Zuo and Violetta Shevchenko and Hadi Mohaghegh Dolatabadi and Thalaiyasingam Ajanthan and Sameera Ramasinghe and Chamin Hewa Koneputugodage and Alexander Long},
    year={2025},
    url={https://github.com/PluralisResearch/node0}, 
}

 

Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, and Alexander Long. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism. 2025.

@misc{ramasinghe2025protocolmodelsscalingdecentralized,
    title={Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism}, 
    author={Sameera Ramasinghe and Thalaiyasingam Ajanthan and Gil Avraham and Yan Zuo and Alexander Long},
    year={2025},
    eprint={2506.01260},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2506.01260}, 
}

 

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, and Alexander Long. Nesterov Method for Asynchronous Pipeline Parallel Optimization. ICML. 2025.

@article{ajanthan2025asyncpp,
    title={Nesterov Method for Asynchronous Pipeline Parallel Optimization},
    author={Ajanthan, Thalaiyasingam and Ramasinghe, Sameera and Zuo, Yan and Avraham, Gil and Long, Alexander},
    journal={ICML},
    year={2025}
}

📄 License

Distributed under the Apache-2.0 License. See LICENSE for more information.

Third-party dependencies and their licenses are listed in THIRD_PARTY_LICENSES.md.

🙏 Acknowledgements

Core Framework

This project is built upon the Hivemind library for decentralized deep learning, distributed under the MIT License.

Datasets

This project uses the FineWeb-Edu dataset by HuggingFace, made available under the Open Data Commons Attribution License (ODC-BY) v1.0.

About

Node0: A collaborative event powered by Protocol Learning, our decentralized approach to AI development

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Dockerfile 0.3%