Create a GPU docker container for Datalab. #1367

chmeyers · 2017-05-15T23:33:48Z

Based off of the nvidia Ubuntu 16.04 container. Also switching the non-GPU container to an Ubuntu 16.04 base image for consistency. There were a few required changes to the Dockerfile to make it build with Ubuntu.

The switch to Ubuntu adds ~80MB, which increases startup time by a few seconds, but has the advantage that Tensorflow will no longer Segfault, and it's a much newer OS in general.

There is additional work before the GPU images can be used seemlessly, as the Container OS VM image we currently use doesn't natively support GPU containers, but it's possible to manually run these with nvidia-docker on a VM where GPU drivers are installed.

Based off of the nvidia Ubuntu 16.04 container. Also switching the non-GPU container to an Ubuntu 16.04 base image for consistency.

craigcitro

LGTM, just a few little questions

craigcitro · 2017-05-16T16:54:17Z

containers/base/build.gpu.sh

+
+trap 'rm -rf pydatalab' exit
+
+BASE_IMAGE_SUBSTITUTION="s/_base_image_/nvidia\/cuda:8.0-cudnn5-devel-ubuntu16.04/"


small thing: you can use characters other than / to avoid backslash escaping, eg

s,_base_image_,nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04,

craigcitro · 2017-05-16T16:55:20Z

containers/base/Dockerfile.in

    mkdir -p /srcs && \
    cd /srcs && \
-    apt-get source -d wget git python-zmq ca-certificates pkg-config libpng-dev && \
+    apt-get source --allow-unauthenticated -d wget git python-zmq ca-certificates pkg-config libpng-dev && \


Maybe add a note about why we need --allow-unauthenticated? (Is it a temporary thing?)

Done. It's because Ubuntu can't find the keys to the source git repos. Since we only download these for licensing reasons and don't actually use them, I think it's fine. The apt-get installs above are still authenticated.

craigcitro · 2017-05-17T07:05:10Z

containers/base/Dockerfile.gpu

+MAINTAINER Google Cloud DataLab
+
+# Download and Install GPU specific packages
+RUN pip install -U --upgrade-strategy only-if-needed --no-cache-dir tensorflow-gpu==1.0.1 && \


Just to confirm: we're sure that installing tensorflow-gpu over an existing tensorflow install will correctly replace things as needed?

craigcitro · 2017-05-17T17:50:59Z

LGTM

yelsayd · 2017-05-30T20:20:50Z

Do we also want to change the rollback script?

Adding it at the end to cover the case where it doesn't exist.

ojarjur · 2017-05-16T20:42:19Z

containers/base/Dockerfile.gpu

@@ -0,0 +1,22 @@
+# Copyright 2015 Google Inc. All rights reserved.


Is that copyright year correct?

ojarjur · 2017-05-30T20:43:57Z

containers/base/Dockerfile.in

 # limitations under the License.

-FROM debian:jessie
+FROM _base_image_


Instead of a template, let's make this a local Docker tag (e.g. datalab-base-image), and then whatever the base is will be based on that tag rather than having to munge template files.

Done for the base image. We could do this for the top-level image as well, but that already has a template for the version numbers, so leaving it as is for now.

ojarjur · 2017-05-31T02:06:37Z

tools/release/rollback.sh

 gcloud docker -- push gcr.io/${PROJECT_ID}/datalab:local
+
+echo "Pulling the rollback GPU images: ${DATALAB_GPU_IMAGE}"
+gcloud docker -- pull ${DATALAB_GPU_IMAGE}


Since this is new, we should gracefully handle the situation where the image we are trying to rollback to does not exist.

The only graceful thing to do here is to exit. As this is the last step in the rollback and the other images have already rolled back successfully, failing here will leave things in the desired state.

ojarjur · 2017-05-31T02:06:58Z

containers/base/build.gpu.sh

+
+trap 'rm -rf pydatalab' exit
+
+docker pull nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04


I suspect that this may work just as well as a second step in the main build.sh file, but I don't feel strongly enough about it to block this change.

I mainly kept these separate for development purposes. The GPU build takes significantly longer to complete, and if you are developing locally, you really only want one of them.

ojarjur · 2017-06-01T17:50:46Z

tools/release/rollback.sh

+# This will fail and exit if the previous GPU image doesn't exist.
+# This will happen if we try to rollback the first GPU release, and
+# that is fine since there is nothing to rollback to.
+gcloud docker -- pull ${DATALAB_GPU_IMAGE}


I worry about the following scenario:

We do a new release with the first GPU image.

We have to roll that release back

The rollback gets to this step and fails

The Jenkins job performing the rollback shows up as a failure

The release engineer retries the rollback, causing us to rollback one more release than intended.

How about just adding a || exit 0 to the end of this line?

ojarjur · 2017-06-01T18:49:25Z