AMD ROCm containers

NGC=

Link to section 'What is AMD ROCm' of 'AMD ROCm containers' What is AMD ROCm

The AMD Infinity Hub contains a collection of advanced AMD GPU software containers and deployment guides for HPC, AI & Machine Learning applications, enabling researchers to speed up their time to science. Containerized applications run quickly and reliably in the high performance computing environment with full support of AMD GPUs. A collection of Infinity Hub tools were deployed to extend cluster capabilities and to enable powerful software and deliver the fastest results. By utilizing Singularity and Infinity Hub ROCm-enabled containers, users can focus on building lean models, producing optimal solutions and gathering faster insights. For more information, please visit AMD Infinity Hub.

Link to section 'Getting Started' of 'AMD ROCm containers' Getting Started

Users can download ROCm containers from the AMD Infinity Hub and run them directly using Singularity instructions from the corresponding container’s catalog page.

In addition, a subset of pre-downloaded ROCm containers wrapped into convenient software modules are provided. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.

On clusters equipped with AMD GPUs, type the command below to see the lists of ROCm containers we deployed.

module load rocmcontainers
module avail

------------ ROCm-based application container modules for AMD GPUs -------------
   cp2k/20210311--h87ec1599
   deepspeed/rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1
   gromacs/2020.3                                    (D)
   namd/2.15a2
   openmm/7.4.2
   pytorch/1.8.1-rocm4.2-ubuntu18.04-py3.6
   pytorch/1.9.0-rocm4.2-ubuntu18.04-py3.6           (D)
   specfem3d/20201122--h9c0626d1
   specfem3d_globe/20210322--h1ee10977
   tensorflow/2.5-rocm4.2-dev
[....]

Link to section 'Deployed Applications' of 'AMD ROCm containers' Deployed Applications

cp2k

Link to section 'Description' of 'cp2k' Description

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods AM1, PM3, PM6, RM1, MNDO, ..., and classical force fields AMBER, CHARMM, .... CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimization, and transition state optimization using NEB or dimer method. CP2K is written in Fortran 2008 and can be run efficiently in parallel using a combination of multi-threading, MPI, and HIP/CUDA.

Link to section 'Versions' of 'cp2k' Versions

Bell: 8.2, 20210311--h87ec1599
Negishi: 8.2, 20210311--h87ec1599

Link to section 'Module' of 'cp2k' Module

You can load the modules by:

module load rocmcontainers
module load cp2k

Link to section 'Example job' of 'cp2k' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run cp2k on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=cp2k
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers cp2k

deepspeed

Link to section 'Description' of 'deepspeed' Description

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Link to section 'Versions' of 'deepspeed' Versions

Bell: rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1
Negishi: rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1

Link to section 'Module' of 'deepspeed' Module

You can load the modules by:

module load rocmcontainers
module load deepspeed

Link to section 'Example job' of 'deepspeed' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run deepspeed on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=deepspeed
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers deepspeed

gromacs

Link to section 'Description' of 'gromacs' Description

GROMACS is a molecular dynamics application designed to simulate Newtonian equations of motion for systems with hundreds to millions of particles. GROMACS is designed to simulate biochemical molecules like proteins, lipids, and nucleic acids that have a lot of complicated bonded interactions.

Link to section 'Versions' of 'gromacs' Versions

Bell: 2020.3, 2022.3.amd1
Negishi: 2020.3, 2022.3.amd1

Link to section 'Module' of 'gromacs' Module

You can load the modules by:

module load rocmcontainers
module load gromacs

Link to section 'Example job' of 'gromacs' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run gromacs on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=gromacs
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers gromacs

lammps

Link to section 'Description' of 'lammps' Description

LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel Simulator and is a classical molecular dynamics MD code.

Link to section 'Versions' of 'lammps' Versions

Bell: 2022.5.04
Negishi: 2022.5.04

Link to section 'Module' of 'lammps' Module

You can load the modules by:

module load rocmcontainers
module load lammps

Link to section 'Example job' of 'lammps' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run lammps on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=lammps
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers lammps

namd

Link to section 'Description' of 'namd' Description

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

Link to section 'Versions' of 'namd' Versions

Bell: 2.15a2, 3.0a9
Negishi: 2.15a2, 3.0a9

Link to section 'Module' of 'namd' Module

You can load the modules by:

module load rocmcontainers
module load namd

Link to section 'Example job' of 'namd' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run namd on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=namd
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers namd

openmm

Link to section 'Description' of 'openmm' Description

OpenMM is a high-performance toolkit for molecular simulation. It can be used as an application, a library, or a flexible programming environment. OpenMM includes extensive language bindings for Python, C, C++, and even Fortran. The code is open source and developed on GitHub, licensed under MIT and LGPL.

Link to section 'Versions' of 'openmm' Versions

Bell: 7.4.2, 7.7.0
Negishi: 7.4.2, 7.7.0

Link to section 'Module' of 'openmm' Module

You can load the modules by:

module load rocmcontainers
module load openmm

Link to section 'Example job' of 'openmm' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run openmm on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=openmm
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers openmm

pytorch

Link to section 'Description' of 'pytorch' Description

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

Link to section 'Versions' of 'pytorch' Versions

Bell: 1.8.1-rocm4.2-ubuntu18.04-py3.6, 1.9.0-rocm4.2-ubuntu18.04-py3.6, 1.10.0-rocm5.0-ubuntu18.04-py3.7
Negishi: 1.8.1-rocm4.2-ubuntu18.04-py3.6, 1.9.0-rocm4.2-ubuntu18.04-py3.6, 1.10.0-rocm5.0-ubuntu18.04-py3.7

Link to section 'Module' of 'pytorch' Module

You can load the modules by:

module load rocmcontainers
module load pytorch

Link to section 'Example job' of 'pytorch' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run pytorch on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=pytorch
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers pytorch

rochpcg

Link to section 'Description' of 'rochpcg' Description

HPCG is a HPC benchmark intended to better represent computational and data access patterns that closely match a broad set of scientific workloads. This container implements the HPCG benchmark on top of AMDs ROCm platform.

Link to section 'Versions' of 'rochpcg' Versions

Bell: 3.1.0
Negishi: 3.1.0

Link to section 'Module' of 'rochpcg' Module

You can load the modules by:

module load rocmcontainers
module load rochpcg

Link to section 'Example job' of 'rochpcg' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run rochpcg on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=rochpcg
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers rochpcg

rochpl

Link to section 'Description' of 'rochpl' Description

HPL, or High-Performance Linpack, is a benchmark which solves a uniformly random system of linear equations and reports floating-point execution rate. This container implements the HPL benchmark on top of AMDs ROCm platform.

Link to section 'Versions' of 'rochpl' Versions

Bell: 5.0.5
Negishi: 5.0.5

Link to section 'Module' of 'rochpl' Module

You can load the modules by:

module load rocmcontainers
module load rochpl

Link to section 'Example job' of 'rochpl' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run rochpl on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=rochpl
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers rochpl

specfem3d

Link to section 'Description' of 'specfem3d' Description

SPECFEM3D Cartesian simulates acoustic fluid, elastic solid, coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra structured or not. It can, for instance, model seismic waves propagating in sedimentary basins or any other regional geological model following earthquakes. It can also be used for non-destructive testing or for ocean acoustics.

Link to section 'Versions' of 'specfem3d' Versions

Bell: 20201122--h9c0626d1
Negishi: 20201122--h9c0626d1

Link to section 'Module' of 'specfem3d' Module

You can load the modules by:

module load rocmcontainers
module load specfem3d

Link to section 'Example job' of 'specfem3d' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run specfem3d on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=specfem3d
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers specfem3d

specfem3d_globe

Link to section 'Description' of 'specfem3d_globe' Description

SPECFEM3D Globe simulates global and regional continental-scale seismic wave propagation.

Link to section 'Versions' of 'specfem3d_globe' Versions

Bell: 20210322--h1ee10977
Negishi: 20210322--h1ee10977

Link to section 'Module' of 'specfem3d_globe' Module

You can load the modules by:

module load rocmcontainers
module load specfem3d_globe

Link to section 'Example job' of 'specfem3d_globe' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run specfem3d_globe on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=specfem3d_globe
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers specfem3d_globe

tensorflow

Link to section 'Description' of 'tensorflow' Description

TensorFlow is an end-to-end open source platform for machine learning.

Link to section 'Versions' of 'tensorflow' Versions

Bell: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev
Negishi: 2.5-rocm4.2-dev, 2.7-rocm5.0-dev

Link to section 'Module' of 'tensorflow' Module

You can load the modules by:

module load rocmcontainers
module load tensorflow

Link to section 'Example job' of 'tensorflow' Example job

Using #!/bin/sh -l as shebang in the slurm job script will cause the failure of some biocontainer modules. Please use #!/bin/bash instead.

To run tensorflow on our clusters:

#!/bin/bash
#SBATCH -A gpu
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --gpus-per-node=1
#SBATCH --job-name=tensorflow
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --error=%x-%J-%u.err
#SBATCH --output=%x-%J-%u.out

module --force purge
ml rocmcontainers tensorflow

This example demonstrates how to run Tensorflow on AMD GPUs with rocmcontainers modules.

First, prepare the matrix multiplication example from Tensorflow documentation:

# filename: matrixmult.py
import tensorflow as tf

# Log device placement
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Submit a Slurm job, making sure to request GPU-enabled queue and desired number of GPUs. For illustration purpose, the following example shows an interactive job submission, asking for one node (${resource.nodecores} cores) in the "gpu" account with and two GPUs for 6 hours, but the same applies to your production batch jobs as well:

sinteractive -A gpu -N 1 -n ${resource.nodecores} -t 6:00:00 --gres=gpu:2
salloc: Granted job allocation 5401130
salloc: Waiting for resource configuration
salloc: Nodes ${resource.hostname}-g000 are ready for job

Inside the job, load necessary modules:

module load rocmcontainers
module load tensorflow/2.5-rocm4.2-dev

And run the application as usual:

python matrixmult.py
Num GPUs Available:  2
[...]
2021-09-02 21:07:34.087607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 32252 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:83:00.0)
[...]
2021-09-02 21:07:36.265167: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-02 21:07:36.266755: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)