In the second part of the article, we’ll look at inference. A special shout out to my team member Abhishek Dasgupta without whom this article would not have been possible.
What is TensorFlow
TensorFlow is an opensource software for design, build, and training of deep learning models.
You can find more details on TensorFlow in the following link – https://www.tensorflow.org/
A typical end-to-end workflow with TensorFlow looks like this:
The first step is the training, which can be either on GPU or CPU based systems. The trained model is then made available (exported) to applications via TensorFlow Serving. Exporting a model for inference is like deploying any application and handling application specific nuances like scaling, availability etc. Once the model is made available, any application can make use of the exported model for inference. Inference can also leverage either GPU or CPU based systems.
Both training and inference can be on the same Kubernetes cluster or different clusters. For example the training can be on an on-prem cluster, where as inference using the trained model can happen off-prem for test/dev applications.
This section describes the various pre-requisites that are required when planning to deploy TensorFlow with Docker and Kubernetes on OpenPower servers.
At a minimum, you’ll need to use Kubernetes 1.6 version which adds support for multiple GPUs. Kubernetes binaries for Power are available from the project release page. Instructions for setting up Kubernetes on OpenPower servers can be found here.
– NVIDIA software
The following software stack is used on the OpenPower servers (IBM S822LC for HPC ) in our setup:
- CUDA 8.0 toolkit
Refer to the following link for CUDA 8.0 download – https://developer.nvidia.com/cuda-downloads?cm_mc_uid=&cm_mc_sid_50200000=#linux-power8
On Ubuntu 16.04, Cuda 8.0 packages will be available under /usr/local/cuda-8.0/ and /usr/lib/powerpc64le-linux-gnu/ after installation.
cuDNN is available for download from the following link – https://developer.nvidia.com/cudnn
Ensure cuDNN is extracted in /usr/local/cuda-8.0
- Nvidia 375 (nvidia-375) driver
Download Nvidia driver from the following link – http://www.nvidia.com/download/driverResults.aspx/115753/en-us
nvidia-375 driver is installed on the host on /usr/lib/nvidia-375/
The above mentioned paths will be used in the steps mentioned below. Ensure you use the correct paths for CUDA toolkit and Nvidia driver based on your specific environment.
1. Build Tensorflow
We’ll be using the community TensorFlow code with PowerPC architecture patches on top of it. However, if you plan to try out the instructions using pre-compiled binaries then you’ll need to use the PowerAI offering. More details available from the following link – https://www.ibm.com/in-en/marketplace/deep-learning-platform
Example Dockerfiles for PowerAI can be downloaded from the following github project – https://github.com/ibmsoe/Dockerfiles/tree/master/powerai-examples
The following instructions will build TensorFlow:
$ git clone https://github.com/ai-infra/tensorflow-automated-build.git $ cd tensorflow-automated-build $ docker build -t ppc64le/tensorflow-bin -f Dockerfile.ppc64le .
Copy the binary to host folder. The instructions below assumes /foo is the folder on the host where the built binaries will be copied from the Docker image.
$ docker run -v /foo:/foo ppc64le/tensorflow-bin /bin/bash -c "cp -R /output/*.whl /foo" $ docker run -v /foo:/foo ppc64le/tensorflow-bin /bin/bash -c "cp -R /usr/bin/bazel /foo"
We’ll need the bazel and tensorflow binary for the next step.
2. Build Docker image for Training
We’ll build a Docker image for training. We have used the example described in ‘How to Fine-Tune a Pre-Trained Model on a New Task” section in the following link – https://github.com/tensorflow/models/tree/master/inception
The following instructions will build the Docker image for training:
$ git clone https://github.com/ai-infra/tensorflow-automated-training.git tf-training $ cd tf-training $ cp /foo/bazel . $ cp /foo/*.whl .
Run the following command to build the Docker image:
$ docker build -t ppc64le/tf-train-flowers -f Dockerfile.ppc64le .
3. Start training using standalone Docker
The following command will start the training. Trained model is available in the host at /root/runs
$ docker run -it --privileged -v /usr/local/cuda-8.0/:/usr/local/cuda-8.0/ \ -v /usr/lib/powerpc64le-linux-gnu/:/usr/lib/powerpc64le-linux-gnu/ \ -v /usr/lib/nvidia-375/:/usr/lib/nvidia-375/ \ -v /root/runs:/flowers-train ppc64le/tf-train-flowers \ /bin/bash -c \ "./run-trainer.sh 10000 && rsync -ah flowers_train/ flowers-train/" $ ls /root/runs/ checkpoint model.ckpt-35000.data-00000-of-00001 model.ckpt-40000.meta model.ckpt-49999.index events.out.tfevents.1490704717.jarvis model.ckpt-35000.index model.ckpt-45000.data-00000-of-00001 model.ckpt-49999.meta model.ckpt-30000.data-00000-of-00001 model.ckpt-35000.meta model.ckpt-45000.index model.ckpt-30000.index model.ckpt-40000.data-00000-of-00001 model.ckpt-45000.meta model.ckpt-30000.meta model.ckpt-40000.index model.ckpt-49999.data-00000-of-00001
4. Start training by deploying on a Kubernetes cluster
Once the Docker image is ready, deploying it in a Kubernetes cluster is a breeze.
An example YAML file is available from the repo.
$ kubectl create –f https://raw.githubusercontent.com/ai-infra/tensorflow-automated-training/master/tf-inception-trainer-flowers.yaml
Let us know if you come across any issues when using TensorFlow with Docker or Kubernetes on OpenPower.