-
Notifications
You must be signed in to change notification settings - Fork 248
Create a GPU docker container for Datalab. #1367
Conversation
Based off of the nvidia Ubuntu 16.04 container. Also switching the non-GPU container to an Ubuntu 16.04 base image for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few little questions
containers/base/build.gpu.sh
Outdated
|
||
trap 'rm -rf pydatalab' exit | ||
|
||
BASE_IMAGE_SUBSTITUTION="s/_base_image_/nvidia\/cuda:8.0-cudnn5-devel-ubuntu16.04/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small thing: you can use characters other than /
to avoid backslash escaping, eg
s,_base_image_,nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mkdir -p /srcs && \ | ||
cd /srcs && \ | ||
apt-get source -d wget git python-zmq ca-certificates pkg-config libpng-dev && \ | ||
apt-get source --allow-unauthenticated -d wget git python-zmq ca-certificates pkg-config libpng-dev && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note about why we need --allow-unauthenticated
? (Is it a temporary thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It's because Ubuntu can't find the keys to the source git repos. Since we only download these for licensing reasons and don't actually use them, I think it's fine. The apt-get installs above are still authenticated.
MAINTAINER Google Cloud DataLab | ||
|
||
# Download and Install GPU specific packages | ||
RUN pip install -U --upgrade-strategy only-if-needed --no-cache-dir tensorflow-gpu==1.0.1 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm: we're sure that installing tensorflow-gpu
over an existing tensorflow
install will correctly replace things as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
LGTM |
Do we also want to change the rollback script? |
Adding it at the end to cover the case where it doesn't exist.
containers/base/Dockerfile.gpu
Outdated
@@ -0,0 +1,22 @@ | |||
# Copyright 2015 Google Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that copyright year correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
containers/base/Dockerfile.in
Outdated
# limitations under the License. | ||
|
||
FROM debian:jessie | ||
FROM _base_image_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a template, let's make this a local Docker tag (e.g. datalab-base-image
), and then whatever the base is will be based on that tag rather than having to munge template files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for the base image. We could do this for the top-level image as well, but that already has a template for the version numbers, so leaving it as is for now.
tools/release/rollback.sh
Outdated
gcloud docker -- push gcr.io/${PROJECT_ID}/datalab:local | ||
|
||
echo "Pulling the rollback GPU images: ${DATALAB_GPU_IMAGE}" | ||
gcloud docker -- pull ${DATALAB_GPU_IMAGE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is new, we should gracefully handle the situation where the image we are trying to rollback to does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only graceful thing to do here is to exit. As this is the last step in the rollback and the other images have already rolled back successfully, failing here will leave things in the desired state.
|
||
trap 'rm -rf pydatalab' exit | ||
|
||
docker pull nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this may work just as well as a second step in the main build.sh
file, but I don't feel strongly enough about it to block this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly kept these separate for development purposes. The GPU build takes significantly longer to complete, and if you are developing locally, you really only want one of them.
tools/release/rollback.sh
Outdated
# This will fail and exit if the previous GPU image doesn't exist. | ||
# This will happen if we try to rollback the first GPU release, and | ||
# that is fine since there is nothing to rollback to. | ||
gcloud docker -- pull ${DATALAB_GPU_IMAGE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about the following scenario:
- We do a new release with the first GPU image.
- We have to roll that release back
- The rollback gets to this step and fails
- The Jenkins job performing the rollback shows up as a failure
- The release engineer retries the rollback, causing us to rollback one more release than intended.
How about just adding a || exit 0
to the end of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tools/cli/commands/creategpu.py
Outdated
_DATALAB_NETWORK = 'datalab-network' | ||
_DATALAB_NETWORK_DESCRIPTION = 'Network for Google Cloud Datalab instances' | ||
|
||
_DATALAB_FIREWALL_RULE = 'datalab-network-allow-ssh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, and the following line, are unused and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tools/cli/commands/creategpu.py
Outdated
|
||
_NVIDIA_PACKAGE = 'cuda-repo-ubuntu1604_8.0.61-1_amd64.deb' | ||
_DATALAB_NETWORK = 'datalab-network' | ||
_DATALAB_NETWORK_DESCRIPTION = 'Network for Google Cloud Datalab instances' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tools/cli/commands/creategpu.py
Outdated
'--no-connect' flag.""") | ||
|
||
_NVIDIA_PACKAGE = 'cuda-repo-ubuntu1604_8.0.61-1_amd64.deb' | ||
_DATALAB_NETWORK = 'datalab-network' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used, but I'd rather delete it and switch the one use to create.DATALAB_NETWORK
(i.e. expose the other constant to this package).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tools/cli/commands/creategpu.py
Outdated
_DATALAB_FIREWALL_RULE = 'datalab-network-allow-ssh' | ||
_DATALAB_FIREWALL_RULE_DESCRIPTION = 'Allow SSH access to Datalab instances' | ||
|
||
_DATALAB_DEFAULT_DISK_SIZE_GB = 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following line can be deleted.
tools/cli/commands/creategpu.py
Outdated
_DATALAB_DISK_DESCRIPTION = ( | ||
'Persistent disk for a Google Cloud Datalab instance') | ||
|
||
_DATALAB_NOTEBOOKS_REPOSITORY = 'datalab-notebooks' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also replace this with create.DATALAB_NOTEBOOKS_REPOSITORY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tools/cli/commands/creategpu.py
Outdated
|
||
_DATALAB_NOTEBOOKS_REPOSITORY = 'datalab-notebooks' | ||
|
||
_DATALAB_STARTUP_SCRIPT = """#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of code duplicated from the create.py
file here.
Can we move that off to something like a base_startup_script
constant, and then have the two packages just define suffixes for the base startup script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Based off of the nvidia Ubuntu 16.04 container. Also switching the non-GPU container to an Ubuntu 16.04 base image for consistency. There were a few required changes to the Dockerfile to make it build with Ubuntu.
The switch to Ubuntu adds ~80MB, which increases startup time by a few seconds, but has the advantage that Tensorflow will no longer Segfault, and it's a much newer OS in general.
There is additional work before the GPU images can be used seemlessly, as the Container OS VM image we currently use doesn't natively support GPU containers, but it's possible to manually run these with nvidia-docker on a VM where GPU drivers are installed.