// Stopping Containers Correctly

Stopping a container with


$ podman stop container-name
or

$ docker stop container-name

will send SIGTERM to the first process (PID 1) and shut down the container when the process terminates. If this doesn't happen within a certain time frame (default is 10s), the runtime will send SIGKILL to the process and take the container down.

So far so good, things are getting interesting when your container process isn't PID 1.

This is already the case if the process is started via a shell script.


#!/bin/bash

... 

java $FLAGS $APP

Attempting to stop this container will terminate the script, while the JVM will keep running. The container runtime is usually smart enough to notice that a process is still active after the script terminated and will wait the grace period anyway, before shutting down the container forcefully. The JVM however won't notice anything and won't have the opportunity to call shutdown hooks, write JFR dumps or finish transactions.

signal delegation

One way to solve this is by delegating the signal from the shell script to the main process:


... 
java $FLAGS $APP & # detach process from script
PID=$!             # remember process ID

trap 'kill -TERM $PID' INT TERM # delegate kill signal to JVM

wait $PID   # attach script to JVM again; note: TERM signal unblocks this wait
trap - TERM INT
wait $PID   # wait for JVM to exit after signal delegation
EXIT_STATUS=$?

The second wait prevents the script from exiting before the JVM finished termination and is required since the first wait is unblocked as soon the script received the signal.

it still didn't work

Interestingly, after implementing this (and trying out other variations of the same concept) it still didn't work for some reason - debugging showed the trap never fired.

Turns out that nothing was wrong with the signal delegation - signals just never reached the script :). So I searched a bit around and found this article which basically described the same async/wait/delegate method in greater detail (thats where I stole the EXIT_STATUS line from), so I knew it had to work. Another great article gave me the idea to check the Dockerfile again.


FROM ...
...
CMD ./start.sh

The sh shell interpreting the bash script was the first process!


$ podman ps
CONTAINER ID  IMAGE                     COMMAND             ...
de216106ff39  localhost/test:latest  /bin/sh -c ./start.sh  ...

htop (on the host) in tree view shows it more clearly:


$ htop
    1 root       ... /sbin/init
15643 podpilot   ...  - /usr/libexec/podman/conmon --api-version ...
15646 100996     ... |   - /bin/sh -c ./start.sh ...
15648 100996     ... |      - /bin/bash ./start.sh
15662 100996     ... |         - /home/jdk/bin/java -Xshare:on ...

To fix this a different CMD (or ENTRYPOINT) syntax is needed:


FROM ...
...
CMD [ "./start.sh" ]

Lets check again after rebuild + run:


$ podman ps
CONTAINER ID  IMAGE                     COMMAND         ...
72e3e60ed60b  localhost/test:latest  ./start.sh  ...

$ htop
    1 root       ... /sbin/init
15746 podpilot   ...  - /usr/libexec/podman/conmon --api-version ...
15749 100996     ... |   - /bin/bash ./start.sh ...
15771 100996     ... |      - /home/jdk/bin/java -Xshare:on ...

Much better. Since the script is now executed directly, it is able to receive and delegate the signals to the JVM. The Java Flight Recorder records also appeared in the volume, which meant that the JVM had enough time to convert the JFR repository into a single record file. The podman stop command also returned within a fraction of a second.

Since the trap is listening to SIGINT too, even the CTRL+C signal is properly propagated when the container is started in non-detached mode. Nice bonus for manual testing.

alternatives

Starting the JVM with


exec java $FLAGS $APP

will replace the shell process with the JVM process without changing PID or process name. Disadvantage: no java commandline in top and the shell won't execute any lines after the exec line (because it basically doesn't exist anymore).

... and if you don't care about the container life cycle too much you can always tell the JVM directly to shutdown, this will close all parent shells bottom-up until PID 1 terminated which will finally stop the container.


podman exec -it container sh -c "kill \$(jps | grep -v Jps | cut -f1 -d' ')"

- - -

lessons learned:
sometimes two square brackets make the difference :)


// [Java in] Rootless Containers with Podman

I have always been a little surprised how quickly it became acceptable to run applications wrapped in containers as root processes. Nobody would have run a web server as root before docker became mainstream if there was some way to avoid it. But with docker it became OK to have the docker daemon and the container processes all running as root. The first item in most docker tutorials became how to elevate your user rights so that you don't have to type sudo before every docker command.

But this doesn't have to be the case of course. One project I had an eye on was podman which is a container engine implementing the docker command line interface with quite good support for rootless operations. With the release of Podman 2.0.x (and the fact that it is slowly making it into the debian repositories) I started to experiment with it a bit more. (for the experimental rootless mode of Docker check out this page)

cgroups v2

Containers rely heavily on kernel namespaces and a feature called control groups. To properly run rootless containers the kernel must be supporting and running with cgroups v2 enabled. To check if cgroups v2 are enabled simply run:


ls /sys/fs/cgroup
cgroup.controllers  cgroup.max.depth  cgroup.max.descendants  cgroup.procs ...

If the files are prefixed with cgroup. you are running cgroups v2, if not, its still v1.

Many distributions will still run with cgroups v1 enabled by default for backwards compatibility. But you can enable cgroups v2 with the systemd kernel flag: systemd.unified_cgroup_hierarchy=1. To do this with grub for example:

  • edit /etc/default/grub and
  • add systemd.unified_cgroup_hierarchy=1 to the key GRUB_CMDLINE_LINUX_DEFAULT (space separated list)
  • then run sudo grub-mkconfig -o /boot/grub/grub.cfg and reboot.

... and make sure you are not running an ancient linux kernel.

crun

The underlying OCI implementation has to support cgroups v2 too. I tested mostly on crun which is a super fast and lightweight alternative to runc. The runtime can be passed to podman via the --runtime flag


podman --runtime /usr/bin/crun <commands>

but it got picked up automatically in my case after I installed the package (manjaro linux, runc is still installed too).


podman info | grep -A5 ociRuntime
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 0.14.1

subordinate uid and gids

The last step required to set up rootless containers are /etc/subuid and /etc/subgid. If the files don't exist yet, create them and add a mapping range from your user name to container users. For example the line:

duke:100000:65536

Gives duke the right to create 65536 users in container images, starting from UID 100000. Duke himself will be mapped by default to root (0) in the container. (Same must be done for groups in subgid). The range should never overlap with UIDs on the host system. Details in man subuid. More on users later in the volumes section.

rootless containers

Some things to keep in mind:
  • rootless podman runs containers with less privileges than the user which started the container
    • some of these restrictions can be lifted (via --privileged, for example)
    • but rootless containers will never have more privileges than the user that launched them
    • root in the container is the user on the host
  • rootless containers have no IP or MAC address, because nw device association requires root privileges
    • podman uses slirp4netns for user mode networking
    • pinging something from within a container won't work out of the box - but don't panic: it can be configured if desired

podman

Podman uses the same command-line interface as Docker and it also understands Dockerfiles. So if everything is configured correctly it should all look familiar:

$ podman version
Version:      2.0.2
API Version:  1
Go Version:   go1.14.4
Git Commit:   201c9505b88f451ca877d29a73ed0f1836bb96c7
Built:        Sun Jul 12 22:46:58 2020
OS/Arch:      linux/amd64

$ podman pull debian:stable-slim
...
$ podman images
REPOSITORY                       TAG          IMAGE ID      CREATED       SIZE
docker.io/library/debian         stable-slim  56fae066253c  4 days ago    72.5 MB
...
$ podman run --rm debian:stable-slim cat /etc/debian_version
10.4

Setting alias docker=podman allows existing scripts to be reused without modification - but I stick here to podman to not cause confusion.

container communication

Rootless containers don't have their own IP addresses but you can bind them to ports (>1024). Host2container communication works therefore analog to communicating with any service you would have running on the host.


$ podman run --name wiki --rm -d -p 8443:8443 jspwiki-jetty
$ podman port -a
fd4c06b454ee	8443/tcp -> 0.0.0.0:8443
$ firefox https://localhost:8443/wiki

To setup quick and dirty container2container communication you can let them communicate over the IP address of the host (or host name) and ports, if the firewall is OK with that. But a better maintainable approach are pods. Pods are groups of containers which belong together. It is basically a infrastructure container, containing the actual containers. All containers in a pod share the same localhost and use it for pod-local communication. The outside world is reached via opened ports on the pod.

Lets say we have a blog and a db. The blog requires the db but all the host cares about is the https port of the blog container. So we can simply put blog container and db container into a blog-pod and let both communicate via pod-local localhost (podhost?). The https port is opened on the blog-pod for the host while the db isn't reachable from the outside.


$ podman pod create --name blogpod -p 8443:8443

# note: a pod starts out with one container already in it,
# it is the infrastructure container - basically the pod itself
$ podman pod list
POD ID        NAME     STATUS   CREATED        # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  7 seconds ago  1                af7baf0e7fde

$ podman run --pod blogpod --name blogdb --rm -d blog-db
$ podman run --pod blogpod --name apacheroller --rm -d roller-jetty

$ podman pod list
POD ID        NAME     STATUS   CREATED         # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  30 seconds ago  3                af7baf0e7fde

$ firefox https://localhost:8443/blog

Now we already have two containers able to communicate with each other and a host which is able to communicate with a container in the pod - and no sudo in sight.

volumes and users

We already know that the user on the outside is root on the inside, but lets quickly check it just to be sure:


$ whoami
duke
$ id -u
1000
$ mkdir /tmp/outside
$ podman run --rm -it -v /tmp/outside:/home/inside debian:stable-slim bash
root@2fbc9edaa0ee:/$ id -u
0
root@2fbc9edaa0ee:/$ touch /home/inside/hello_from_inside && exit
$ ls -l /tmp/outside
-rw-r--r-- 1 duke duke 0 31. Jul 06:55 hello_from_inside

Indeed, duke's UID of 1000 was mapped to 0 on the inside.

Since we are using rootless containers and not half-rootless containers we can let the blog and the db within their containers run in their own user namespaces too, what if they write logs to mounted volumes? That is when the subuid and subgid files come into play we configured earlier.

Lets say the blog-db container process should run in the namespace of the user dbduke. Since dbduke doesn't have root rights on the inside (as intended), dbduke won't also have rights to write to the mounted volume which is owned by duke. The easiest way to solve this problem is to simply chown the volume folder on the host to the mapped user of the container.


# scripts starts blog-db
# query the UID from the container and chown the volumes folder
UID_INSIDE=$(podman run --name UID_probe --rm blog-db /usr/bin/id -u)
podman unshare chown -R $UID_INSIDE /path/to/volume

podman run -v /path/to/volume:/path/inside ...  blog-db

Podman ships with a tool called unshare (the name is going to make less sense the longer you think about it) which lets you execute commands in the namespace of a different user. The command podman unshare allows to use the rights of duke to chown a folder to the internal UID of dbduke.

If we would check the folder rights from both sides, we would see that the UID was mapped from:


podman run --name UID_probe --rm blog-db /usr/bin/id -u
998
to

$ ls -l volume/
drwxr-xr-x 2 100997 100997 4096 31. Jul 07:54 logs
on the outside which is within the range specified in the /etc/subuid file - so everything works as intended. This allows user namespace isolation between containers (dbduke, wikiduke etc) and also between containers and the user on the host who launched the containers (duke himself).

And still no sudo in sight.

memory and cpu limits [and java]

Memory limits should work out of the box in rootless containers

$ podman run -it --rm -m 256m blog-db java -Xlog:os+container -version
[0.003s][info][os,container] Memory Limit is: 268435456
...

This allows the JVM to make smarter choices without having to provide absolute -Xmx flags (but you still can).

Setting CPU limits might not work out of the box without root (tested on Manjaro which is basically Arch), since the cgroups config might have user delegation turned off by default. But it is very easy to change:


# assuming your user id is 1000 like duke
$ sudo systemctl edit user@1000.service

# now modify the file so that it contains
[Service]
Delegate=yes

# and check if it worked
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.controllers
cpuset cpu io memory pids

You might have to reboot - it worked right away in my case.


# default CPU settings uses all cores
$ podman run -it --rm blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 4

# assign specific cores to container
$ podman run -it --rm --cpuset-cpus 1,2 blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 2

Container CPU core limits should become less relevant in the java world going forward, especially once projects like Loom [blog post] have been integrated. Since most things in java will run on virtual threads on top of a static carrier thread pool, it will be really easy to restrict the parallelism level of a JVM (basically -Djdk.defaultScheduler.parallelism=N and maybe another flag to limit max GC thread count).

But it works if you need it for rootless containers too.

class data sharing

Podman uses fuse-overlayfs for image management by default, which is overlayfs running in user mode.

$ podman info | grep -A5 overlay.mount_program
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: Unknown
      Version: |-
        fusermount3 version: 3.9.2
        fuse-overlayfs: version 1.1.0

This means that JVM class data sharing is also supported out of the box if the image containing the class data archive is shared in the image graph between multiple rootless containers.

The class data stored in debian-slim-jdk (a local image I created) will be mapped only once into memory and shared between all child containers. Which are in the example below blog-db, roller-jetty and wiki-jetty.


$ podman image tree --whatrequires debian-slim-jdk
Image ID: 57c885825969
Tags:     [localhost/debian-slim-jdk:latest]
Size:     340.1MB
Image Layers
└── ID: ab5467b188d7 Size: 267.6MB Top Layer of: [localhost/debian-slim-jdk:latest]
 ├── ID: fa04d6485aa5 Size:  7.68kB
 │   └── ID: 649de1f63ecc Size: 11.53MB Top Layer of: [localhost/debian-slim-jdk-jetty:latest]
 │    ├── ID: 34ce3917399d Size: 8.192kB
 │    │   └── ID: d128b826459d Size:  56.7MB Top Layer of: [localhost/roller-jetty:latest]
 │    ├── ID: 9a9c51927e42 Size: 8.192kB
 │    └── ID: d5c7176c1073 Size: 27.56MB Top Layer of: [localhost/wiki-jetty:latest]
 ├── ID: 06acb45dd590 Size:  7.68kB
 └── ID: 529ff7e97882 Size: 1.789MB Top Layer of: [localhost/blog-db:latest]

stopping java containers cleanly

A little tip in the end: In my experience JVMs are quite often launched from scripts in containers. This has automatically the side effect that the JVM process won't be PID 1, since it isn't the entry point. Commands like podman stop <container> will post SIGKILL to the shell script (!) wait 10s then simply kill the container process without the JVM ever knowing what was going on. More on that in the stopping containers blog entry.

Shutdown actions like dumping the JFR file won't be executed and IO writes might not have completed. So unless you trap the signal in the script and send it to the JVM somehow, there are more direct ways to stop a java container:


# script to cleanly shut down a container
$ podman exec -it $(container) sh -c\
 "kill \$(/jdk/bin/jps | grep -v Jps | cut -f1 -d' ')"

#... or if killall is installed (it usually isn't)
$ podman exec -it $(container) killall java

Once the JVM has cleanly shut down the launch script will finish which will be also noticed by the container once PID 1 is gone and it will cleanly shutdown too.

- - -

fly safe podman