Container solutions for HPC

Document Author: Simon Butcher, QMUL Research IT

Date 10th January 2017

Version 1.1

1. Summary

Linux containers are self-contained execution environments that share a Linux kernel with the host, but have isolated resources for CPU, I/O, memory, etc. To an end user, this looks a bit like a virtual machine, but there are key differences, besides the lower resource overhead.

This document describes some of the benefits and restrictions of containers on HPC, and outlines the proposed solution.

2. Key features of containers

  • Linux containers are typically stateless, and don’t have persistent storage (data created during one session does not remain available for another session). However, containers commonly allow you to mount storage from outside the container (e.g. home directory)
  • containers such as docker are intended to run a single application. rather than general purpose containers, they are specifically designed around the single service/application you are running. This tends to mean multiple containers working together for larger services.
  • because your data doesn't live in the container (rather the shared volume), containers are relatively disposable
  • containers can be deployed in matter of seconds, versus a few minutes for a VM , and a few days for a physical server
  • Most container solutions are relatively immature technology: new features are added regularly, and APIs/compatibility may change between releases in a short space of time.
  • containers allow portability of environments, such that an Ubuntu container can be run on the HPC cluster (running Red Hat) and also copied to a user's laptop and run there with identical results.

Source: http://kubernetes.io/docs/whatisk8s/

2.1 Container solutions available

Due to the Open Container Project there are a variety of container solutions available, each with a slightly different focus, but maintaining some level of compatibility

  • Docker expanded upon the original LXC containers and is the most well known and de facto standard. (In 2015 Google and announced they would discontinue their own lmctfy container solution in favour of using Docker and instead merging their concepts into the underlying libcontainer library)^3

Since Univa has some integration to facilitate running docker containers we will look at this option, despite it not being designed for HPC, plus take a closer look at container systems developed for HPC. The best-known ones are

  • Singularity is actively worked on and seems to be gaining popularity among big players in the HPC community due to simplicity and ease of use in a variety of environments.

  • Shifter was initially being developed for NERSC's Cray system. It has potential but is not as outward-focused or as generic as singularity.

2.2 Differences between container solutions

  • Docker, because of its part in the devops toolchain, is service-centric, in that it is commonly used to deploy a service (maybe a scalable collection of containers to run a website, each running a single service). HPC containers will commonly be focused around encapsulating a single application and its dependencies in container that runs for the life of a compute job.
  • Docker uses an underlying daemon which is a service that runs as root. Docker also needs installing on a machine to be able to use it, and additional work to create a logical volume used by docker.
  • Singularity (and others such as runc) are portable and can be run from a shared location, and loaded as a module on a HPC cluster.
  • Some container solutions allow privilege escalation within a container (e.g. Docker) (either by fair means or foul). Other solutions are created with the explicit goal of disallowing this i.e. the user outside the container is the same as the user inside the container.

2.3 Use of containers in software companies

The evolution of container technology such as docker, has been largely driven by the devops culture, putting power into the hands of developers to prepare, test and deploy their code changes with confidence (with the superuser rights that they were rarely given in a typical corporate infrastructure), perhaps in a continuous integration environment.

At the same time, applications have gone from large monolithic code blocks to smaller distributed services that can be worked on separately - consider the practicalities of a company like Google with 25,000 developers, making 45,000 changes per day to 2billion lines of code ^1

The desire to run a company's entire infrastructure in containers has given rise to container management systems (e.g. Kubernetes, Docker swarm). In 2014, still the early days of containers, Google claimed that "Everything runs in a container...over 2 billion containers are started every week"^2

Due to the design goals and audience, some container solutions are not necessarily suitable for an HPC environment (particularly with regard to security)

2.4 Requirements of containers on HPC

HPC clusters have different requirements to those of a software company. The main drivers for HPC are portability, security, performance and repeatability. Requirements for HPC containers and applications are rather different from running scalable web sites or a situation where developers working on a big application want to test code on their own environments.

Various container solutions have come about to satisfy the HPC niche. HPC clusters typically have shared filesystems and many users with varying levels of ability and trustworthiness. Typically an HPC container solution needs to have the following features

  • does not allow any elevation of privileges
  • maintains the stability and security of a system (does not cause kernel panics)
  • isolates resources (does not impinge on other user's shared use of the system)
  • works well with a job scheduler
  • easy to install and maintain over a whole cluster
  • not too invasive (does not require services and major changes to compute or management nodes)
  • performs well. if i/o in a container only delivers 50% of the native performance then this is not suitable
  • ideally supports MPI and GPU applications

2.4.1 What existing problem would containers solve?

The QMUL cluster serves a wide variety of users and applications. We use environment modules to serve applications from a shared location. This allows us to maintain multiple versions of applications and removes the requirement to install packages directly onto every compute node that may potentially run the application. Direct installation of rpms is avoided, with a preference to make applications available via the shared filesystem.

Usually this means compiling an application from source, sometimes tuning compilation parameters for optimal performance. If an application depends upon a lot of libraries or ones that are not available on the currently installed operating system, then this increases the workload and complexity hugely (if the application is even possible to install at all), and also creates a mess of dependencies that would need to be individually built and updated when the application version changes. It becomes a huge support burden and also slows the rate of deployment of applications.

Containers allow us to run an application built on Centos7 or Ubuntu, on our current SL6.2 cluster, and also utilise pre-built package installation from repositories where recent packages are provided - since the resulting packages and dependencies are installed into the container, the solution becomes portable without manual builds, or direct package installation onto compute nodes. This feature of containers is a major bonus for supportability.

2.4.2 Which applications would work well in HPC containers?

We would expect an HPC container to typically contain one application and its dependencies. For example, pandoc requires hundreds of dependencies, which are satisfied simply from repositories, but would be very time consuming to compile by hand to a custom location for a shared environment such as HPC.

So, in the case of pandoc, we can provide a container that has the rpm files and its dependencies installed, but these just reside in the container, and require no extra packages installed on the compute nodes.

Similarly, there may be some R environments that are suitable for certain people, that might normally require significant work installing packages into a personal R library. Providing these in a ready-built container would allow a portable and consistent R environment for people to use.

There are also quite a few bioinformatics applications that have install scripts with a huge amount of dependencies. Often these are automatically pulled from repositories using the package install method. Building all these dependencies into /share/apps for users is often not feasible, and often these installation scripts perform yum installs, or download code from Github. This doesn't fit the usual model of installing applications by hand to /share/apps, nor is installation to a home directory practical because of the sheer number of dependencies. One such example is FRAMA which has a huge list of dependencies such as texlive, trinity, R, perl, samtools, etc. Running the install script in a container will install the 450 dependencies in a safe, repeatable and isolated place.

2.4.3 What wouldn't work well?

In our environment, HPC containers should not (and usually cannot) try to replace Virtual Machines (i.e. a large bundled suite of applications), or web services (a collection of services running daemons, network services, root privileges, web servers) . The container should ideally be a self-contained application that can be called as part of a Grid Engine qsub file, either on its own or part of a pipeline with other apps or containers.

3. Testing Docker on an HPC Cluster at QMUL

Univa have added tighter docker integration within their job scheduler in the last 12 months^4. This allows a more seamless method of running and selecting docker containers on a grid engine cluster in a user-friendly manner, allowing selection of images using the grid engine command line. Univa only support a certain range of docker versions, due to rapid changes in the docker API

To perform the testing we installed the latest 8.4.3 version of Univa on the test cluster, and docker-1.10.3-44.el7.centos.x86_64 on the compute nodes for testing (later Centos versions of docker did not work with Univa despite being in the accepted version range, illustrating the difficulty of tight integration with docker)

3.1 Issues with Docker

There were a number of issues experienced with Docker testing that we consider show-stoppers:

  • Despite assurances by Univa that grid engine would allow strict control of docker containers that can be run on the system, Univa's docker integration will download any image specified by the qsub command, even if they are not resident on the system. This would allow potentially vulnerable, and un-vetted containers to be run on the cluster.
  • Univa uses a wildcard format to select images. Coupled with the first point, this can lead to mistakenly loading bad containers e.g.
qsub -b y -l docker,docker_images="*ubuntu*" \
 -cwd -Sappl /bin/bash "ls;cat /etc/lsb-release"
  • Docker allows privilege escalation within a container. Combined with the ability to mount filesystems in the container, this allows a user to easily gain access to mounted filesystems with root access.

Example: user can upload a dock which has a known root password. once inside the dock, the user can switch user to root and then traverse any mounted filesystems as root, even adding themselves to sudoers on the host machine. This has been demonstrated on our test system.

Since GPFS filesystems are mounted on compute nodes, this means that in its current state, we cannot use docker on any system with shared filesystems, or where users do not have root privileges.

3.2 Future developments

User namespaces^5, a recent feature in docker since v1.10, would still allow root access inside the container but would disallow interaction with external systems with escalated privileges. This is a new feature with certain restrictions ("In general, user namespaces are an advanced feature and will require coordination with other capabilities"^6). It may not work with GPFS, and likely requires a later kernel than provided with Centos6 (user namespace support arrived in Linux kernel v3.8). It is also still only a small part of the docker security picture, which was not designed with security in mind.

3.3 Docker Summary

For us to consider docker more closely, the Univa integration would need to improve in terms of logging, container management and restrictions, and security, and the aforementioned Docker security issues would need to be resolved.

Even if Univa fix the issues experienced, the features of Docker would be frozen by the version supported by the scheduler, locking us in to older versions.

4. Testing singularity on an HPC Cluster

Singularity has been created with HPC in mind, although use is not restricted to HPC. One of the defining characteristics is "A user inside a Singularity container is the same user as outside the container". As we have seen with Docker, this is a critical requirement for multi-user systems and shared filesystems. With Singularity, if a user cannot gain root privileges outside the container, they cannot gain root inside the container ^7

4.1 Satisfaction of Requirements

4.1.1 Does it work with the job scheduler?

After loading the singularity module, it can be run as a normal application.

$ module load singularity
$ singularity exec ./Centos7.img /usr/bin/python hello.py
Hello World: The Python version is 2.7.5

4.1.2 Is it a relatively mature product, with a future?

Nothing is guaranteed, but the project is open source and comes from Berkeley National laboratory, and is gaining good traction. The project leader has a good track record, and was the CentOS lead in the early days. Project development is very active, according to the frequency of github commits and version releases. Our bug reports have been fixed within a day, plus our own code contributions have been accepted.

4.1.3 Will we be locking ourselves in to a legacy system if the project dies?

Due to the compatible nature of docks, we wouldn't be totally locked in, should something happen to the lead developer (although there are multiple developers on the project), or if they go off course regarding features and strategy. We would hopefully be able to move to a similar project without changing the containers, although definition files, our workflows and qsub scripts would need to be changed to accommodate different syntax. Since it is an open source project developed in sight of the community, the risk is lower than using a closed-source proprietary product.

4.1.4 Is it secure?

Singularity does not make any changes to the host network interfaces, nor does it install a service or daemon. As previously discussed (and tested), it does not allow privilege escalation within the container. Since the kernel is shared by the container, any kernel vulnerabilities would need to be present in the host system to be exploited. It does seem possible to remotely request containers (e.g. from the docker registry site). We would prefer to control which containers can be used e.g. local containers owned by a specific user account. This feature has been discussed with the developers and may appear in a future version.

4.1.5 Is it easy to use?

The command line options are simple (typically exec , run, shell, test as different ways of running code in the container). In use, it is almost seamless in its implementation.

4.1.6 Is it easy to deploy?

Once installed on a shared file system, and a module file is created, then compute nodes could run containers without further installation required (root access would be required to create and bootstrap new containers in the current version - 2.2 at time of writing).

4.1.7 How does it perform?

Common consensus is that containerised applications take less than 5% extra time to run ^8, and that Singularity has similar performance as Docker. However we have also had some outlying results, for example, up to 20% longer to run a certain python code inside a container. The reasons for this are being investigated. Grid Engine is recording higher memory use for containerised jobs than with identical native python code. The pilot phase will show a wider real-world view.

4.1.8 How can I try it out?

Documentation for the pilot is provided on this page