Node0 is a collaborative event powered by Protocol Learning, our decentralized approach to AI development.
This is the first pretraining run open to the public that can use commodity hardware, so anyone can join and help train this model. This training run supports participants with compute resources as small as a 16GB consumer GPU (e.g., an Nvidia 3090). For the first time, small devices are contributing to a global training cluster, demonstrating that the online community can collaboratively train massive models.
Each participant (node) holds a small portion of the model's computation graph. Despite being connected over the internet (which has 100x slower speeds compared to datacenter connections), we are able to achieve training speeds and performances on par with centralized systems, utilizing the same amount of compute. The swarm of nodes is fully dynamic, meaning participants can join and leave at any time. As a result, the system scales horizontally – the more nodes join, the faster training proceeds.
Because decentralized protocols allow us to pool compute resources from many sources, they enable access to more computational power than is typically available in centralized datacenters. This opens up the possibility to train larger models than ever before, giving the community the ability to push the boundaries of AI training.
This is a hands-on demonstration of our vision: to make AI more accessible and less centralized by pooling compute resources from a global community.
Each participant's progress is logged - check their contribution in the dashboard.
PC/Server with Nvidia GPU:
- Minimum 16GB GPU memory
- Minimum 32GB RAM
- Minimum 80GB disk space (required for Docker image)
- Linux
- Windows + WSL2 (enable CUDA support in WSL)
The port that you are exposing (default is 49200) must be accessible to external connections. Follow this guide to open the port.
Follow this guide for how to set up cloud instances that meet the requirements.
Create HuggingFace access token (instruction). The token doesn't need any read/write permissions as it will only be used for authorization.
Checkout this list of useful links if you need further help with the installation.
The easiest way to get started is using Docker:
# Clone repository
git clone https://github.com/PluralisResearch/node0
cd node0
# Build the image
docker build . -t pluralis_node0Note: on some systems you may need to run all Docker commands with sudo.
If Docker fails to use GPU, configure Docker to use NVIDIA GPU as the default runtime:
sudo apt install jq
sudo touch /etc/docker/daemon.json
sudo bash -c 'jq ". += {\"default-runtime\": \"nvidia\", \"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]}" /etc/docker/daemon.json > /etc/docker/daemon.json.tmp && mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json'
sudo systemctl restart docker- Python 3.11
- pip or conda package manager
Conda (install)
# Clone repository
git clone https://github.com/PluralisResearch/node0
cd node0
# Create conda environment
conda create -n node0 python=3.11
conda activate node0
# Install node0
pip install .To create the script that starts the server, run the following command:
python3 generate_script.py --use_dockerpython3 generate_script.pyThis command generates a start_server.sh file.
The generate_script.py script requires a few authorization parameters:
- HuggingFace token (follow instructions in the Requirements section to obtain one)
- Email address (optional) - please provide your email address so we can get in touch with you
You can either enter these interactively (default) or pass them as command-line arguments:
--token <HF_token>--email <email_address>
To skip interactive prompts entirely, use --skip_input flag.
Currently, Node0 supports single-GPU training only.
If you're running on a multi-GPU machine, you can specify which GPU to use with the --gpu_id flag:
--gpu_id <ID>By default, the script uses GPU 0.
If you want to run multiple training scripts (each one with a different GPU) on the same machine, each instance needs its own node0 repository folder (as otherwise private keys and logs get overridden). Each instance should also use a different host/announce port. See below how to change the default exposed port.
By default, Node0 exposes port 49200 for P2P connections. If you need to modify the exposed port, use the following flags:
--host_port <port> # The internal port the library will listen on
--announce_port <port> # The external port other peers will connect toIn most cases, host_port and announce_port should be identical. However, some compute providers (e.g., RunPod) assign random external port mappings. In this case:
- Set
--host_portto the internal port you want the library to listen on - Set
--announce_portto the external port that has been mapped by your provider
To join the experiment, run:
./start_server.shThe stdout from the Docker container is saved in the run.out file. To keep track of the training, you can use this command to see the file updates in the terminal:
tail -f run.outIf running outside of Docker, we recommend running the code inside a tmux session, which allows the script to persist when you disconnect from the machine and provides better control for resource cleanup when stopping the script.
You can stop the server at any time and rejoin the experiment later — your contribution will be saved.
Do not delete private.key file between runs (see Important section).
To stop the server, run:
# Find the container name or ID
docker ps
# Stop the server
docker stop <container_name_or_ID>
# Remove the container
docker rm <container_name_or_ID>
# Update file ownership
sudo chown -R <linux_user> .Press Ctrl + Z in the terminal running the server to stop it.
Remove temporary files and free ports:
# Remove socket files
rm /tmp/hivemind*
# Install lsof (omit sudo if running on cloud providers without sudo access)
sudo apt-get install lsof
# Kill all processes using host port (default port number is 49200)
# (omit sudo if running on cloud providers without sudo access)
for i in $(sudo lsof -t -i tcp:<host_port>); do kill -9 $i; doneIf the GPU memory is not released, you can run the following command to kill all python processes that use GPU.
# Install lsof (omit sudo if running on cloud providers without sudo access)
sudo apt-get install lsof
# Kill processes (omit sudo if running on cloud providers without sudo access)
for i in $(sudo lsof /dev/nvidia* | grep python | awk '{print $2}' | sort -u); do kill -9 $i; doneIf the server has been stopped, you can restart it as follows:
./start_server.shMake sure you've followed all the steps in the Stop the server section to properly stop the old run first.
It may take few minutes for the server to find other peers and join the training. Check the training logs (logs/server.log) or stdout to monitor the process. If you're running the code inside a Docker, stdout is saved in run.out file.
At first, you will see that authorization is completed and new parameters are downloaded from a peer:
INFO:node0.security.authorization: Access for user username has been granted until 2025-04-15 12:59:12.613600+00:00 UTC INFO:node0.security.authorization: Authorization completed ... INFO:hivemind.averaging.averager: Downloading parameters from peer <...> INFO:hivemind.averaging.averager: Finished downloading state in 0.309s from <...>
After some time the training will start and you will see similar logs:
INFO:hivemind.moe.server.runtime: Processed 51 batches in last 60 seconds: INFO:hivemind.moe.server.runtime: body2.0.919_backward: 27 batches (100.62 batches/s), 108 examples (402.50 examples/s), avg batch size 4.00 INFO:hivemind.moe.server.runtime: body2.0.919_forward: 24 batches (382.51 batches/s), 96 examples (1530.02 examples/s), avg batch size 4.00
Sanity check: a healthy peer will periodically report Processed [N] batches in last 60 seconds and Averaged gradients/parameters with [N > 1] peers.
The code generates a private.key file during initial setup. This file:
- Contains your node's cryptographic identity
- Is required for secure communication within the network
- Should be kept confidential and never shared publicly
All files created within the Docker container have a different level of ownership. To modify/delete them outside of the container, you need to reclaim ownership.
Linux:
sudo chown -R <linux_user> <path/to/project>The following errors in logs are typically related to internet connectivity:
-
Averaging failed with <class 'TimeoutError'>This occurs when slow internet connections prevent averaging operations from completing in time. This is generally acceptable.
⚠️ Note: If you only see averaging errors and never see "Averaged gradients/averaged parameters" messages, your internet connection may be too slow for proper operation. -
Averaging step failed: could not find a groupSimilar to the above error, this is caused by network connectivity issues during averaging operations.
-
An error occurred during the speed test, skippingIndicates that the library couldn't perform an internet speed test. This error can be safely ignored as it doesn't affect core functionality.
-
Your node cannot be reached by other peers. Please check that the announced port is open to inbound connections.Verify that your port is open for both inbound and outbound connections (follow this guide). Check that no firewall rules are blocking the connection.
-
Failed to load state from peers.This error is likely caused by the slow internet connection.
The following errors may occur during the authorization process:
-
Invalid tokenVerify that you've provided a valid Hugging Face token.
-
Verification failedIf you've made local changes, revert them or use a clean copy of the repository. If you recently pulled new commits from the repository:
- Reinstall the library (if installed from source)
- Rebuild the Docker image (if using Docker)
-
This peer_id is already used by another userThis error occurs when attempting to join an experiment with a different Hugging Face account. Delete the private.key file and try again.
If you use this project in your research, please cite:
Gil Avraham*, Yan Zuo*, Violetta Shevchenko, Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin Hewa Koneputugodage, and Alexander Long. Node0: Model Parallel Training over the Internet with Protocol Models. 2025.
*equal contribution
@misc{avraham2025node0,
title={Node0: Model Parallel Training over the Internet with Protocol Models},
author={Gil Avraham and Yan Zuo and Violetta Shevchenko and Hadi Mohaghegh Dolatabadi and Thalaiyasingam Ajanthan and Sameera Ramasinghe and Chamin Hewa Koneputugodage and Alexander Long},
year={2025},
url={https://github.com/PluralisResearch/node0},
}
Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, and Alexander Long. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism. 2025.
@misc{ramasinghe2025protocolmodelsscalingdecentralized,
title={Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism},
author={Sameera Ramasinghe and Thalaiyasingam Ajanthan and Gil Avraham and Yan Zuo and Alexander Long},
year={2025},
eprint={2506.01260},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.01260},
}
Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, and Alexander Long. Nesterov Method for Asynchronous Pipeline Parallel Optimization. ICML. 2025.
@article{ajanthan2025asyncpp,
title={Nesterov Method for Asynchronous Pipeline Parallel Optimization},
author={Ajanthan, Thalaiyasingam and Ramasinghe, Sameera and Zuo, Yan and Avraham, Gil and Long, Alexander},
journal={ICML},
year={2025}
}Distributed under the Apache-2.0 License. See LICENSE for more information.
Third-party dependencies and their licenses are listed in THIRD_PARTY_LICENSES.md.
This project is built upon the Hivemind library for decentralized deep learning, distributed under the MIT License.
This project uses the FineWeb-Edu dataset by HuggingFace, made available under the Open Data Commons Attribution License (ODC-BY) v1.0.