TensorFlow Training with Docker and Kubernetes on OpenPower Servers

This is the first part of a two-part article describing TensorFlow deployment for training using Docker and Kubernetes cluster running on OpenPower servers with NVIDIA Tesla P100 GPUs .

In the second part of the article, we’ll look at inference. A special shout out to my team member Abhishek Dasgupta without whom this article would not have been possible.

What is TensorFlow

TensorFlow is an opensource software for design, build, and training of deep learning models.
You can find more details on TensorFlow in the following link – https://www.tensorflow.org/

A typical end-to-end workflow with TensorFlow looks like this:

tf-workflow

The first step is the training, which can be either on GPU or CPU based systems. The trained model is then made available (exported) to applications via TensorFlow Serving. Exporting a model for inference is like deploying any application and handling application specific nuances like scaling, availability etc. Once the model is made available, any application can make use of the exported model for inference. Inference can also leverage either GPU or CPU based systems.

Both training and inference can be on the same Kubernetes cluster or different clusters. For example the training can be on an on-prem cluster, where as inference using the trained model can happen off-prem for test/dev applications.

Pre-requisites

This section describes the various pre-requisites that are required when planning to deploy TensorFlow with Docker and Kubernetes on OpenPower servers.

 – Kubernetes
At a minimum, you’ll need to use Kubernetes 1.6 version which adds support for multiple GPUs. Kubernetes binaries for Power are available from the project release page. Instructions for setting up Kubernetes on OpenPower servers can be found here.

– NVIDIA software
The  following software stack is used on the OpenPower servers (IBM S822LC for HPC ) in our setup:

  • Ubuntu-16.04
  • CUDA 8.0 toolkit

Refer to the following link for CUDA 8.0 download – https://developer.nvidia.com/cuda-downloads?cm_mc_uid=&cm_mc_sid_50200000=#linux-power8
On Ubuntu 16.04, Cuda 8.0 packages will be available under /usr/local/cuda-8.0/ and /usr/lib/powerpc64le-linux-gnu/ after installation.

  • cuDNN

cuDNN is available for download from the following link –  https://developer.nvidia.com/cudnn
Ensure cuDNN is extracted in /usr/local/cuda-8.0

  • Nvidia 375 (nvidia-375) driver

Download Nvidia driver from the following link – http://www.nvidia.com/download/driverResults.aspx/115753/en-us
nvidia-375 driver is installed on the host on /usr/lib/nvidia-375/

The above mentioned paths will be used in the steps mentioned below. Ensure you use the correct paths for CUDA  toolkit and Nvidia driver based on your specific environment.

Setup Instructions
1. Build Tensorflow

We’ll be using the community TensorFlow code with PowerPC architecture patches on top of it. However, if you plan to try out the instructions using  pre-compiled binaries then you’ll need to use the PowerAI offering. More details available from the following link – https://www.ibm.com/in-en/marketplace/deep-learning-platform
Example Dockerfiles for PowerAI can be downloaded from the following github project – https://github.com/ibmsoe/Dockerfiles/tree/master/powerai-examples

The following instructions will build TensorFlow:

$ git clone https://github.com/ai-infra/tensorflow-automated-build.git
$ cd tensorflow-automated-build
$ docker build -t ppc64le/tensorflow-bin -f Dockerfile.ppc64le .

Copy the binary to host folder.  The instructions below assumes /foo is the folder on the host where the built binaries will be copied from the Docker image.

$ docker run -v /foo:/foo ppc64le/tensorflow-bin /bin/bash -c "cp -R /output/*.whl /foo"
$ docker run -v /foo:/foo ppc64le/tensorflow-bin /bin/bash -c "cp -R /usr/bin/bazel /foo"

We’ll need the bazel and tensorflow binary for the next step.

2. Build Docker image for Training

We’ll build a Docker image for training. We have used the example described in ‘How to Fine-Tune a Pre-Trained Model on a New Task” section in the following link – https://github.com/tensorflow/models/tree/master/inception

The following instructions will build the Docker image for training:

$ git clone https://github.com/ai-infra/tensorflow-automated-training.git tf-training
$ cd tf-training
$ cp /foo/bazel .
$ cp /foo/*.whl .

Run the following command to build the Docker image:

$ docker build -t ppc64le/tf-train-flowers -f Dockerfile.ppc64le .

3. Start training using standalone Docker

The following command will start the training. Trained model is available in the host at /root/runs

$ docker run -it --privileged -v /usr/local/cuda-8.0/:/usr/local/cuda-8.0/ \
         -v /usr/lib/powerpc64le-linux-gnu/:/usr/lib/powerpc64le-linux-gnu/ \
         -v /usr/lib/nvidia-375/:/usr/lib/nvidia-375/ \
         -v /root/runs:/flowers-train ppc64le/tf-train-flowers \
         /bin/bash -c \
         "./run-trainer.sh 10000 && rsync -ah flowers_train/ flowers-train/"
$ ls  /root/runs/
checkpoint                             model.ckpt-35000.data-00000-of-00001  model.ckpt-40000.meta                 model.ckpt-49999.index
events.out.tfevents.1490704717.jarvis  model.ckpt-35000.index                model.ckpt-45000.data-00000-of-00001  model.ckpt-49999.meta
model.ckpt-30000.data-00000-of-00001   model.ckpt-35000.meta                 model.ckpt-45000.index
model.ckpt-30000.index                 model.ckpt-40000.data-00000-of-00001  model.ckpt-45000.meta
model.ckpt-30000.meta                  model.ckpt-40000.index                model.ckpt-49999.data-00000-of-00001

4. Start training by deploying on a Kubernetes cluster
Once the Docker image is ready, deploying it in a Kubernetes cluster is a breeze.
An example YAML file is available from the repo.

$ kubectl create –f https://raw.githubusercontent.com/ai-infra/tensorflow-automated-training/master/tf-inception-trainer-flowers.yaml

Let us know if you come across any issues when using TensorFlow with Docker or Kubernetes on OpenPower.

Pradipta Kumar Banerjee

I'm a Cloud and Linux/ OpenSource enthusiast, with 16 years of industry experience at IBM. You can find more details about me here - Linkedin

You may also like...