Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Elastic Deep Learning using PaddlePaddle and Kubernetes

License

gavin1332/edl

 
 

EDL: Elastic Deep Learning

EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning frameworks such as PaddlePaddle and TensorFlow. EDL includes a Kubernetes controller, PaddlePaddle auto-scaler, which changes the number of processes of distributed jobs to the idle hardware resource in the cluster, and a new fault-tolerable architecture.

EDL is an incubation-stage project of the LF AI Foundation.

While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes

  1. the global utilization of the cluster, and
  2. the waiting time of job submitters.

For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.

EDL includes two parts:

  1. a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and

  2. making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.

We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University. The performance test report of EDL on this cluster is here.

Tutorials

Design Docs

FAQ

TBD

License

PaddlePaddle EDL is provided under the Apache-2.0 license.

About

Elastic Deep Learning using PaddlePaddle and Kubernetes

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 53.0%
  • Go 39.2%
  • Shell 5.7%
  • CMake 1.7%
  • Dockerfile 0.4%