Description
TensorFlow is an end-to-end open source platform for machine learning.
Versions
- Bell: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev
- Negishi: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev
Module
You can load the modules by:
module load rocmcontainers
module load tensorflow
Example job
Using #!/bin/sh -l
as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash
To run tensorflow on our clusters:
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=tensorflow
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out
module --force purge
ml rocmcontainers tensorflow
This example demonstrates how to run Tensorflow on AMD GPUs with rocmcontainers modules.
First, prepare the matrix multiplication example from Tensorflow documentation:
# filename: matrixmult.py
import tensorflow as tf
# Log device placement
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
Submit a Slurm job, making sure to request GPU-enabled queue and desired number of GPUs. For illustration purpose, the following example shows an interactive job submission, asking for one node (${resource.nodecores} cores) in the "gpu" account with and two GPUs for 6 hours, but the same applies to your production batch jobs as well:
sinteractive -A gpu -N 1 -n ${resource.nodecores} -t 6:00:00 --gres=gpu:2
salloc: Granted job allocation 5401130
salloc: Waiting for resource configuration
salloc: Nodes ${resource.hostname}-g000 are ready for job
Inside the job, load necessary modules:
module load rocmcontainers
module load tensorflow/2.5-rocm4.2-dev
And run the application as usual:
python matrixmult.py
Num GPUs Available: 2
2021-09-02 21:07:34.087607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32252 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:83:00.0)
2021-09-02 21:07:36.265167: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-02 21:07:36.266755: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)