kube-openmpi provides mainly two things:
- Kubernetes manifest template (powered by Helm) to run open mpi jobs on kubernetes cluster. See
chartdirectory for details. - base docker images on DockerHub to build your custom docker images. Currently we provide only ubuntu 16.04 based imaages. To support distributed deep learning workloads, we provides cuda based images, too. Supported tags are below:
- Plain Ubuntu based:
2.1.2-16.04-0.7.0/0.7.0- naming convention:
$(OPENMPI_VERSION)-$(UBUNTU_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)$(UBUNTU_IMAGE_TAG)refers to tags of ubuntu
- naming convention:
- Cuda (with cuDNN7) based:
- cuda8.0:
2.1.2-8.0-cudnn7-devel-ubuntu16.04-0.7.0/0.7.0-cuda8.0 - cuda9.0:
2.1.2-9.0-cudnn7-devel-ubuntu16.04-0.7.0/0.7.0-cuda9.0 - cuda9.1:
2.1.2-9.1-cudnn7-devel-ubuntu16.04-0.7.0/0.7.0-cuda9.1 - naming convention is
$(OPENMPI_VERSION)-$(CUDA_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)$(CUDA_IMAGE_TAG)refers to tags of nvidia/cuda
- see Dockerfile
- cuda8.0:
- Chainer, Cupy, ChainerMN image:
- cuda8.0:
0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0 - cuda9.0:
0.7.0-cuda9.0-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0 - cuda9.1:
0.7.0-cuda9.1-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0 - naming convention is
$(KUBE_OPENMPI_VERSION)-$(CUDA_VERSION)-nccl$(NCCL_CUDA80_PACKAGE_VERSION)-chainer$(CHAINER_VERSION)-chainermn$(CHAINER_MN_VERSION) - see Dockerfile.chainermn
- cuda8.0:
- Quick Start
- Use your own custom docker image
- Inject your code to your containers from Github
- Run kube-openmpi cluster as non-root user
- How to use gang-scheduling (i.e. schedule a group of pods at once)
- Run ChainerMN Job
- Release Notes
- kubectl: follow the installation step
- helm client: follow the installatin step.
- Kubernetes cluster (minikube is super-handy for local test.)
# generate temporary key
$ ./gen-ssh-key.sh
# edit your values.yaml
$ $EDITOR values.yaml
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
# wait until $MPI_CLUSTER_NAME-master is ready
$ kubectl get -n $KUBE_NAMESPACE po $MPI_CLUSTER_NAME-master
# You can run mpiexec now via 'kubectl exec'!
# hostfile is automatically generated and located '/kube-openmpi/generated/hostfile'
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 4 -npernode 1 \
sh -c 'echo $(hostname):hello'
Data for JOB [43686,1] offset 0
======================== JOB MAP ========================
Data for node: MPI_CLUSTER_NAME-worker-0 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 0 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-1 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 1 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-2 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 2 Bound: UNBOUND
Data for node: MPI_CLUSTER_NAME-worker-3 Num slots: 2 Max slots: 0 Num procs: 1
Process OMPI jobid: [43686,1] App: 0 Process rank: 3 Bound: UNBOUND
=============================================================
MPI_CLUSTER_NAME-worker-1:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-3:hello
MPI workers forms StatefulSets. So, you can scale up or down the cluster.
# scale workers from 4 to 3
$ kubectl -n $KUBE_NAMESPACE scale statefulsets $MPI_CLUSTER_NAME-worker --replicas=3
statefulset "MPI_CLUSTER_NAME-worker" scaled
# Then you can mpiexec again
# hostfile will be updated automatically every 15 seconds in default
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 3 -npernode 1 \
sh -c 'echo $(hostname):hello'
...
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-1:hello
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE delete -f -
please edit image section in values.yaml
image:
repository: yourname/kube-openmpi-based-custom-image
tag: latest
It expects that your custom image is based on our base image (everpeace/kube-openmpi) and does NOT change any ssh/sshd configurations define in image/Dockerfile on your custom image.
Please refer to Custom ChainerMN image example on kube-openmpi for details.
Please create a Secret of docker-registry type to your namespace by referring here.
And then, you can specify the secret name in your values.yaml:
image:
repository: <your_registry>/<your_org>/<your_image_name>
tag: <your_tag>
pullSecrets:
- name: <docker_registry_secret_name>
kube-openmpi supports to import your codes hosted by github into your containers. To do it, please edit appCodesToSync section in values.yaml. You can define multiple github repositories.
appCodesToSync:
- name: your-app-name
gitRepo: https://github.com/org/your-app-name.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: /repo
When your code are in private git repository. The secret repo must be able to access via ssh.
Please remember this feature requires securityContext.runAs: 0 for side-car containers fetching your code into mpi containers.
You need to register ssh key to the repo. I recommend you to set up Deploy Keys for your secret repo because it is valid only for the target repository and read-only.
- github: Managing Deploy Keys | Github Developer Guide
- bitbucket: Use access keys | Bitbucket Support
Create generic type Secret which has a key ssh and its value is the private key.
$ kubectl create -n $KUBE_NAMESPACE secret generic <git-sync-cred-name> --from-file=ssh=<deploy-private-key-file>
Then, you can define appCodesToSync entries with the secret
- name: <your-secret-repo>
gitRepo: git@<git-server>:<your-org>/<your-secret-repo>.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: <mount-point>
gitSecretName: <git-sync-cred-name>
At default, kube-openmpi runs your mpi cluster as root user. However, from security standpoint, you might want to run your mpi-cluster as non-root user. There is two way to achieve this.
kube-openmpi base docker images on DockerHub ships such normal user openmpi with uid=1000/gid=1000. To make the user run your mpi-cluster, edit your values.yaml to specify SecurityContext like below:
# values.yaml
...
mpiMaster:
securityContext:
runAsUser: 1000
fsGroup: 1000
...
mpiWorkers:
securityContext:
runAsUser: 1000
fsGroup: 1000
Then you can run mpiexec as openmpi user. You would need to tear down and re-deploy your mpi-cluster if you had kube-openmpi cluster already.
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 4 -npernode 1 \
sh -c 'echo $(hostname):hello'
...
You need to build your own custom base image because the custom user with your desired uid/gid must exists(embedded) in the docker image. To do this, just run make with several options below.
$ cd images
$ make REPOSITORY=<your_org>/<your_repo> SSH_USER=<username> SSH_UID=<uid> SSH_GID=<gid>
This creates ubuntu based image, cuda8(cudnn7) image and cuda9(cudnn7) image.
And then, set the image in your values.yaml and set your uid/gid to runAsUser/fsGroup as the previous section.
As stated kubeflow/tf-operator#165, spawning multiple kube-openmpi cluster causes deadlock. To prevent it, you might want gang-scheduling (i.e schedule multiple pods all together) in kubernetes. Currently, kubernetes-incubator/kube-arbitrator support it by using kube-batchd scheduler and PodDisruptionBudget.
Please follow the steps:
-
Edit
mpiWorkers.customSchedulingsection in yourvalues.yamllike this.mpiWorkers: customScheduling: enabled: true schedulerName: <your_kube-batchd_scheduler_name> podDisruptionBudget: enabled: true -
Deploy your kube-openmpi cluster.
We published Chainer,ChainerMN(with CuPy and NCCL2) based image. Let's use it. In this example, we run train_mnist example in ChainerMN repo. If you wanted to build your own docker image. Please refer to Custom ChainerMN image example on kube-openmpi for details.
- edit your
values.yamlso that
- kube-openmpi uses the image.
- allocate
2mpi workers and assign1GPU resource to each mpi worker. - add
appCodesToSyncentry to runtrain_mnistexample with ChainerMN.
image:
repository: everpeace/kube-openmpi
tag: 0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0
...
mpiWorkers:
num: 2
resources:
limits:
nvidia.com/gpu: 1
...
appCodesToSync:
- name: chainermn
gitRepo: https://github.com/chainer/chainermn.git
gitBranch: master
fetchWaitSecond: "120"
mountPath: /chainermn-examples
subPath: chainermn/examples
...
- Deploy your kube-openmpi cluster
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
- Run
train_mnistwith GPU
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 2 -npernode 1 \
python3 /chainermn-examples/mnist/train_mnist.py -g
======================== JOB MAP ========================
Data for node: MPI_CLUSTER_NAME-worker-0 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
Data for node: MPI_CLUSTER_NAME-worker-1 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
=============================================================
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
...
1 0.224002 0.102322 0.9335 0.9695 17.1341
2 0.0733692 0.0672879 0.977967 0.9765 24.7188
...
20 0.00531046 0.105093 0.998267 0.9799 160.794
- docker base images:
- fix
init.shso that non-root user won't fail to runinit.sh
- fix
- kubernetes manifests:
- add master pod to compute nodes. now openmpi jobs can run in master pod. This enables users to use single-node openmpi jobs.
- docker base images:
- CMD was changed from
start_sshd.shtoinit.sh. WhenONE_SHOTwastrue,init.shwill execute user command which as passed an arguments toinit.shjust after sshd was up.
- CMD was changed from
- kubernetes manifests:
oneShotmode is supported. Auto scale down workers feature is also supported.- In
mpiMaster.oneShotmode,mpiMaster.oneShot.commandwill be automatically executed in master once cluster was up. ifmpiMaster.oneShot.autoScaleDownWorkerswas enabled andmpiMaster.oneShot.commandsuccessfully completed (i.e. return code was0), worker cluster will be scaled down to0.
- In
- docker base images
- cuda9.0 support added.
- ChainerMN images for each cuda versions(8.0, 9.0, 9.1)
- kubernetes manifests:
- supported docker registry secret to pull docker images from private docker registry
- supported fetching codes from private git repositories
- kubernetes manifests:
- For preventing potential deadlock when scheduling multiple kube-openmpi clusters,
gang-scheduling(schedule a group of pods all together) for mpi workers is now available viakube-batchdinkube-arbitrator.
- For preventing potential deadlock when scheduling multiple kube-openmpi clusters,
- kubernetes manifests:
- support user defined
volumes/volumeMounts - kube-openmpi managed volume names changed.
- support user defined
- Documents
- make
Runstep simpler. Changed to usekubectl exec -it -- mpiexecdirectly.
- make
- docker images:
rootcan ssh to both mpi-master and mpi-workers when containers run as root
- kubernetes manifests:
- now mpi cluster runs as
rootat default - you can use
openmpiuser as before by settingrunAsUser/fsGroupinvalues.yaml - you don't need to dig a tunnel to use
mpiexeccommand! - documented how to use your custom user with custom uid/gid
- now mpi cluster runs as
- docker images:
- added
orte_keep_fqdn_hostnames=ttoopenmpi-mca-params.conf
- added
- kubernetes manifests:
- now you don't need
CustomPodDNSfeature gate!! bootstrapjob was removedhostfile-updaterwas introduced. Now you can scale up/down your mpi cluster dynamically!- It runs next to
mpi-masterpod as a side-car container.
- It runs next to
- The path of auto generated
hostfilewas moved to/kube-openmpi/generated/hostfile
- now you don't need
- docker images:
- removed s6-overlay init process and introduced self-managed sshd script to support
securityContext(e.g.securityContext.runAs) (#1).
- removed s6-overlay init process and introduced self-managed sshd script to support
- kubernetes manifests:
- supported custom
securityContext(#1) - improved mpi-cluster cleanup process
- fixed broken network-policy maniefst
- supported custom
- docker images:
- fixed cuda-aware openMPI installation script. added ensure
mca:mpi:base:param:mpi_built_with_cuda_support:value:truewhen cuda based image was built. You can NOT use open MPI with CUDA on0.1.0. So, please use0.2.0.
- fixed cuda-aware openMPI installation script. added ensure
- kubernetes manifests:
- fixed
resourcesinvalues.yamlwas ignored. - now
workerscan resolvemasterin DNS.
- fixed
- initial release
- automate the process (create kube-openmpi commnd?)
- document chart parameters
- add additional persistent volume claims