Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit a96f7cd

Browse files
committed
Minor editorial changes on README.md; changed the default value for ping_timeout to 360.
1 parent b739f5f commit a96f7cd

3 files changed

Lines changed: 88 additions & 81 deletions

File tree

README.md

Lines changed: 84 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
## Plato: A New Framework for Federated Learning Research
1+
# Plato: A New Framework for Federated Learning Research
22

33
Welcome to *Plato*, a new software framework to facilitate scalable federated learning research.
44

5+
## Installation
6+
57
### Installing Plato with PyTorch
68

79
To install *Plato*, first clone this repository to the desired directory.
@@ -73,6 +75,77 @@ It goes without saying that `/absolute/path/to/project/home/directory` should be
7375

7476
**Tip:** When working in Visual Studio Code as the development environment, one of the project developer's colour theme favourites is called `Bluloco`, both of its light and dark variants are excellent and very thoughtfully designed. The `Pylance` extension is also strongly recommended, which represents Microsoft's modern language server for Python.
7577

78+
### Installing YOLOv5 as a Python package
79+
80+
If object detection using the YOLOv5 model and any of the COCO datasets is needed, it is necessary to install YOLOv5 as a Python package first:
81+
82+
```shell
83+
cd packages/yolov5
84+
pip install .
85+
```
86+
87+
### Installing Plato with MindSpore
88+
89+
Plato is designed to support multiple deep learning frameworks, including PyTorch, TensorFlow, and MindSpore. For MindSpore support, Plato currently supports MindSpore 1.1.1 (1.2.1 and 1.3.0 are not supported, as [they do not support `Tensor` objects to be pickled](https://gitee.com/mindspore/mindspore/issues/I43RPP?from=project-issue) and sent over a network). Though we provided a `Dockerfile` for building a Docker container that supports MindSpore 1.1.1, in rare cases it may still be necessary to install Plato with MindSpore in a GPU server running Ubuntu Linux 18.04 (which MindSpore requires). Similar to a PyTorch installation, we need to first create a new environment with Python 3.7.5 (which MindSpore 1.1.1 requires), and then install the required packages:
90+
91+
```shell
92+
conda create -n mindspore python=3.7.5
93+
pip install -r requirements.txt
94+
```
95+
96+
We should now install MindSpore 1.1.1 with the command provided by the [official MindSpore website](https://mindspore.cn/install).
97+
98+
MindSpore 1.1.1 may also need additional packages, which should installed if they do not exist:
99+
100+
```shell
101+
sudo apt-get install libssl-dev
102+
sudo apt-get install build-essential
103+
```
104+
105+
If CuDNN has not yet been installed, it needs to be installed with the following commands:
106+
107+
```shell
108+
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
109+
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
110+
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
111+
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
112+
sudo apt-get update
113+
sudo apt-get install libcudnn8=8.0.5.39-1+cuda10.1
114+
```
115+
116+
To check the current CuDNN version, the following commands are helpful:
117+
118+
```shell
119+
function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
120+
function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
121+
check libcudnn
122+
```
123+
124+
To check if MindSpore is correctly installed on the GPU server, try to run the command:
125+
126+
```shell
127+
python -c "import mindspore"
128+
```
129+
130+
Finally, to use trainers and servers based on MindSpore, assign `true` to `use_mindspore` in the `trainer` section of the configuration file. If GPU is not available when MindSpore is used, assign `true` to `cpuonly` in the `trainer` section as well. These variables are unassigned by default, and *Plato* would use PyTorch as its default framework.
131+
132+
## Running Plato
133+
134+
### Running Plato using a configuration file
135+
136+
To start a federated learning training workload, run [`run`](run) from the repository's root directory. For example:
137+
138+
```shell
139+
./run --config=configs/MNIST/fedavg_lenet5.yml
140+
```
141+
142+
* `--config` (`-c`): the path to the configuration file to be used. The default is `config.yml` in the project's home directory.
143+
* `--log` (`-l`): the level of logging information to be written to the console. Possible values are `critical`, `error`, `warn`, `info`, and `debug`, and the default is `info`.
144+
145+
*Plato* uses the YAML format for its configuration files to manage the runtime configuration parameters. Example configuration files have been provided in the `configs` directory.
146+
147+
*Plato* can opt to use `wandb` to produce and collect logs in the cloud. If this is needed, add `use_wandb: true` to the `trainer` section in your configuration file.
148+
76149
### Running Plato in a Docker container
77150

78151
Most of the codebase in *Plato* is designed to be framework-agnostic, so that it is relatively straightfoward to use *Plato* with a variety of deep learning frameworks beyond PyTorch, which is the default framwork it is using. One example of such deep learning frameworks that *Plato* currently supports is [MindSpore 1.1.1](https://www.mindspore.cn). Due to the wide variety of tricks that need to be followed correctly for running *Plato* without Docker, it is strongly recommended to run Plato in a Docker container, on either a CPU-only or a GPU-enabled server.
@@ -119,30 +192,7 @@ The provided `Dockerfile` helps to build a Docker image running Ubuntu 20.04, wi
119192

120193
If MindSpore support is needed, the provided `Dockerfile_MindSpore` contains two pre-configured environments for CPU and GPU environments, respectively, called `plato_cpu` or `plato_gpu`. They support [MindSpore 1.1.1](https://github.com/mindspore-ai/mindspore) and Python 3.7.5 (which is the Python version that MindSpore requires). Both Dockerfiles have GPU support enabled. Once an image is built and a Docker container is running, one can use Visual Studio Code to connect to it and start development within the container.
121194

122-
### Installing YOLOv5 as a Python package
123-
124-
If object detection using the YOLOv5 model and any of the COCO datasets is needed, it is necessary to install YOLOv5 as a Python package first:
125-
126-
```shell
127-
cd packages/yolov5
128-
pip install .
129-
```
130-
### Running Plato
131-
132-
To start a federated learning training workload, run [`run`](run) from the repository's root directory. For example:
133-
134-
```shell
135-
./run --config=configs/MNIST/fedavg_lenet5.yml
136-
```
137-
138-
* `--config` (`-c`): the path to the configuration file to be used. The default is `config.yml` in the project's home directory.
139-
* `--log` (`-l`): the level of logging information to be written to the console. Possible values are `critical`, `error`, `warn`, `info`, and `debug`, and the default is `info`.
140-
141-
*Plato* uses the YAML format for its configuration files to manage the runtime configuration parameters. Example configuration files have been provided in the `configs` directory.
142-
143-
*Plato* can opt to use `wandb` to produce and collect logs in the cloud. If this is needed, add `use_wandb: true` to the `trainer` section in your configuration file.
144-
145-
### Potential Runtime Errors
195+
### Potential runtime errors
146196

147197
If runtime exceptions occur that prevent a federated learning session from running to completion, the potential issues could be:
148198

@@ -152,21 +202,21 @@ If runtime exceptions occur that prevent a federated learning session from runni
152202

153203
* The time that a client waits for the server to respond before disconnecting is too short. This could happen when training with large neural network models. If you get an `AssertionError` saying that there are not enough launched clients for the server to select, this could be the reason. But make sure you first check if it is due to the *out of CUDA memory* error.
154204

155-
*Potential solutions:* Add `ping_timeout` in the `server` section in your configuration file. The default value for `ping_timeout` is 20 (seconds). You could specify a larger timeout value, such as 120.
205+
*Potential solutions:* Add `ping_timeout` in the `server` section in your configuration file. The default value for `ping_timeout` is 360 (seconds).
156206

157-
For example, to run a training session with the CIFAR-10 dataset and the ResNet-18 model, and if 10 clients are selected per round, `ping_timeout` needs to be 120. Consider an even larger number if you run with larger models and more clients.
207+
For example, to run a training session on [Google Colaboratory or Compute Canada](https://github.com/TL-System/plato/blob/main/docs/Running.md) with the CIFAR-10 dataset and the ResNet-18 model, and if 10 clients are selected per round, `ping_timeout` needs to be 360 when clients' local datasets are non-iid by symmetric Dirichlet distribution with the concentration of 0.01. Consider an even larger number if you run with larger models and more clients.
158208

159209
* Running processes have not been terminated from previous runs.
160210

161211
*Potential solutions:* Use the command `pkill python` to terminate them so that there will not be CUDA errors in the upcoming run.
162212

163-
### Client Simulation Mode
213+
### Client simulation mode
164214

165215
Plato supports a *client simulation mode*, in which the actual number of client processes launched equals the number of clients to be selected by the server per round, rather than the total number of clients. This supports a simulated federated learning environment, where the set of selected clients by the server will be simulated by the set of client processes actually running. For example, with a total of 10000 clients, if the server only needs to select 100 of them to train their models in each round, only 100 client processes will be launched in client simulation mode, and a client process may assume a different client ID in each round.
166216

167217
To turn on the client simulation mode, add `simulation: true` to the `clients` section in the configuration file.
168218

169-
### Plotting Runtime Results
219+
### Plotting runtime results
170220

171221
If the configuration file contains a `results` section, the selected performance metrics, such as accuracy, will be saved in a `.csv` file in the `results/` directory. By default, the `results/` directory is under the path to the used configuration file, but it can be easily changed by modifying `Config.result_dir` in [`config.py`](config.py).
172222

@@ -178,60 +228,17 @@ python plot.py --config=config.yml
178228

179229
* `--config` (`-c`): the path to the configuration file to be used. The default is `config.yml` in the project's home directory.
180230

181-
### Running Unit Tests
231+
### Running unit tests
182232

183233
All unit tests are in the `tests/` directory. These tests are designed to be standalone and executed separately. For example, the command `python lr_schedule_tests.py` runs the unit tests for learning rate schedules.
184234

185-
### Installing Plato with MindSpore
186-
187-
Plato is designed to support multiple deep learning frameworks, including PyTorch, TensorFlow, and MindSpore. For MindSpore support, Plato currently supports MindSpore 1.1.1 (1.2.1 and 1.3.0 are not supported, as [they do not support `Tensor` objects to be pickled](https://gitee.com/mindspore/mindspore/issues/I43RPP?from=project-issue) and sent over a network). Though we provided a `Dockerfile` for building a Docker container that supports MindSpore 1.1.1, in rare cases it may still be necessary to install Plato with MindSpore in a GPU server running Ubuntu Linux 18.04 (which MindSpore requires). Similar to a PyTorch installation, we need to first create a new environment with Python 3.7.5 (which MindSpore 1.1.1 requires), and then install the required packages:
188-
189-
```shell
190-
conda create -n mindspore python=3.7.5
191-
pip install -r requirements.txt
192-
```
193-
194-
We should now install MindSpore 1.1.1 with the command provided by the [official MindSpore website](https://mindspore.cn/install).
195-
196-
MindSpore 1.1.1 may also need additional packages, which should installed if they do not exist:
197-
198-
```shell
199-
sudo apt-get install libssl-dev
200-
sudo apt-get install build-essential
201-
```
202-
203-
If CuDNN has not yet been installed, it needs to be installed with the following commands:
204-
205-
```shell
206-
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
207-
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
208-
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
209-
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
210-
sudo apt-get update
211-
sudo apt-get install libcudnn8=8.0.5.39-1+cuda10.1
212-
```
213-
214-
To check the current CuDNN version, the following commands are helpful:
215-
216-
```shell
217-
function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
218-
function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
219-
check libcudnn
220-
```
221-
222-
To check if MindSpore is correctly installed on the GPU server, try to run the command:
223-
224-
```shell
225-
python -c "import mindspore"
226-
```
227-
228-
Finally, to use trainers and servers based on MindSpore, assign `true` to `use_mindspore` in the `trainer` section of the configuration file. If GPU is not available when MindSpore is used, assign `true` to `cpuonly` in the `trainer` section as well. These variables are unassigned by default, and *Plato* would use PyTorch as its default framework.
235+
## Deploying Plato
229236

230-
### Deploying Plato Servers in a Production Environment in the Cloud
237+
### Deploying Plato servers in a production environment in the cloud
231238

232239
The Plato federated learning server is designed to use Socket.IO over HTTP and HTTPS, and can be easily deployed in a production server environment in the public cloud. See `/docs/Deploy.md` for more details on how the nginx web server can be used as a reverse proxy for such a deployment in production servers.
233240

234-
### Uninstalling Plato
241+
## Uninstalling Plato
235242

236243
Remove the `conda` environment used to run *Plato* first, and then remove the directory containing *Plato*'s git repository.
237244

@@ -244,6 +251,6 @@ where `federated` (or `mindspore`) is the name of the `conda` environment that *
244251

245252
For more specific documentation on how Plato can be run on GPU cluster environments such as Google Colaboratory or Compute Canada, refer to `docs/Running.md`.
246253

247-
### Technical support
254+
## Technical Support
248255

249256
Technical support questions should be directed to the maintainer of this software framework: Baochun Li ([email protected]).

docs/Running.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -228,9 +228,9 @@ If runtime exceptions occur that prevent a federated learning session from runni
228228

229229
* The time that a client waits for the server to respond before disconnecting is too short. This could happen when training with large neural network models. If you get an `AssertionError` saying that there are not enough launched clients for the server to select, this could be the reason. But make sure you first check if it is due to the *out of CUDA memory* error.
230230

231-
*Potential solutions:* Add `ping_timeout` in the `server` section in your configuration file. The default value for `ping_timeout` is 20 (seconds). You could specify a larger timeout value, such as 120.
232-
233-
For example, to run a training session with the CIFAR-10 dataset and the ResNet-18 model, and if 10 clients are selected per round, `ping_timeout` needs to be 120. Consider an even larger number if you run with larger models and more clients.
231+
*Potential solutions:* Add `ping_timeout` in the `server` section in your configuration file. The default value for `ping_timeout` is 360 (seconds).
232+
233+
For example, to run a training session with the CIFAR-10 dataset and the ResNet-18 model, and if 10 clients are selected per round, `ping_timeout` needs to be 360 when clients' local datasets are non-iid by symmetric Dirichlet distribution with the concentration of 0.01. Consider an even larger number if you run with larger models and more clients.
234234

235235
* Running processes have not been terminated from previous runs.
236236

plato/servers/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ def start(self, port=Config().server.port):
108108
ping_interval = Config().server.ping_interval if hasattr(
109109
Config().server, 'ping_interval') else 3600
110110
ping_timeout = Config().server.ping_timeout if hasattr(
111-
Config().server, 'ping_timeout') else 20
111+
Config().server, 'ping_timeout') else 360
112112
self.sio = socketio.AsyncServer(ping_interval=ping_interval,
113113
max_http_buffer_size=2**31,
114114
ping_timeout=ping_timeout)

0 commit comments

Comments
 (0)