// [Java in] Rootless Containers with Podman

I have always been a little surprised how quickly it became acceptable to run applications wrapped in containers as root processes. Nobody would have run a web server as root before docker became mainstream if there was some way to avoid it. But with docker it became OK to have the docker daemon and the container processes all running as root. The first item in most docker tutorials became how to elevate your user rights so that you don't have to type sudo before every docker command.

But this doesn't have to be the case of course. One project I had an eye on was podman which is a container engine implementing the docker command line interface with quite good support for rootless operations. With the release of Podman 2.0.x (and the fact that it is slowly making it into the debian repositories) I started to experiment with it a bit more. (for the experimental rootless mode of Docker check out this page)

cgroups v2

Containers rely heavily on kernel namespaces and a feature called control groups. To properly run rootless containers the kernel must be supporting and running with cgroups v2 enabled. To check if cgroups v2 are enabled simply run:


ls /sys/fs/cgroup
cgroup.controllers  cgroup.max.depth  cgroup.max.descendants  cgroup.procs ...

If the files are prefixed with cgroup. you are running cgroups v2, if not, its still v1.

Many distributions will still run with cgroups v1 enabled by default for backwards compatibility. But you can enable cgroups v2 with the systemd kernel flag: systemd.unified_cgroup_hierarchy=1. To do this with grub for example:

  • edit /etc/default/grub and
  • add systemd.unified_cgroup_hierarchy=1 to the key GRUB_CMDLINE_LINUX_DEFAULT (space separated list)
  • then run sudo grub-mkconfig -o /boot/grub/grub.cfg and reboot.

... and make sure you are not running an ancient linux kernel.

crun

The underlying OCI implementation has to support cgroups v2 too. I tested mostly on crun which is a super fast and lightweight alternative to runc. The runtime can be passed to podman via the --runtime flag


podman --runtime /usr/bin/crun <commands>

but it got picked up automatically in my case after I installed the package (manjaro linux, runc is still installed too).


podman info | grep -A5 ociRuntime
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 0.14.1

subordinate uid and gids

The last step required to set up rootless containers are /etc/subuid and /etc/subgid. If the files don't exist yet, create them and add a mapping range from your user name to container users. For example the line:

duke:100000:65536

Gives duke the right to create 65536 users in container images, starting from UID 100000. Duke himself will be mapped by default to root (0) in the container. (Same must be done for groups in subgid). The range should never overlap with UIDs on the host system. Details in man subuid. More on users later in the volumes section.

rootless containers

Some things to keep in mind:
  • rootless podman runs containers with less privileges than the user which started the container
    • some of these restrictions can be lifted (via --privileged, for example)
    • but rootless containers will never have more privileges than the user that launched them
    • root in the container is the user on the host
  • rootless containers have no IP or MAC address, because nw device association requires root privileges
    • podman uses slirp4netns for user mode networking
    • pinging something from within a container won't work out of the box - but don't panic: it can be configured if desired

podman

Podman uses the same command-line interface as Docker and it also understands Dockerfiles. So if everything is configured correctly it should all look familiar:

$ podman version
Version:      2.0.2
API Version:  1
Go Version:   go1.14.4
Git Commit:   201c9505b88f451ca877d29a73ed0f1836bb96c7
Built:        Sun Jul 12 22:46:58 2020
OS/Arch:      linux/amd64

$ podman pull debian:stable-slim
...
$ podman images
REPOSITORY                       TAG          IMAGE ID      CREATED       SIZE
docker.io/library/debian         stable-slim  56fae066253c  4 days ago    72.5 MB
...
$ podman run --rm debian:stable-slim cat /etc/debian_version
10.4

Setting alias docker=podman allows existing scripts to be reused without modification - but I stick here to podman to not cause confusion.

container communication

Rootless containers don't have their own IP addresses but you can bind them to ports (>1024). Host2container communication works therefore analog to communicating with any service you would have running on the host.


$ podman run --name wiki --rm -d -p 8443:8443 jspwiki-jetty
$ podman port -a
fd4c06b454ee	8443/tcp -> 0.0.0.0:8443
$ firefox https://localhost:8443/wiki

To setup quick and dirty container2container communication you can let them communicate over the IP address of the host (or host name) and ports, if the firewall is OK with that. But a better maintainable approach are pods. Pods are groups of containers which belong together. It is basically a infrastructure container, containing the actual containers. All containers in a pod share the same localhost and use it for pod-local communication. The outside world is reached via opened ports on the pod.

Lets say we have a blog and a db. The blog requires the db but all the host cares about is the https port of the blog container. So we can simply put blog container and db container into a blog-pod and let both communicate via pod-local localhost (podhost?). The https port is opened on the blog-pod for the host while the db isn't reachable from the outside.


$ podman pod create --name blogpod -p 8443:8443

# note: a pod starts out with one container already in it,
# it is the infrastructure container - basically the pod itself
$ podman pod list
POD ID        NAME     STATUS   CREATED        # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  7 seconds ago  1                af7baf0e7fde

$ podman run --pod blogpod --name blogdb --rm -d blog-db
$ podman run --pod blogpod --name apacheroller --rm -d roller-jetty

$ podman pod list
POD ID        NAME     STATUS   CREATED         # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  30 seconds ago  3                af7baf0e7fde

$ firefox https://localhost:8443/blog

Now we already have two containers able to communicate with each other and a host which is able to communicate with a container in the pod - and no sudo in sight.

volumes and users

We already know that the user on the outside is root on the inside, but lets quickly check it just to be sure:


$ whoami
duke
$ id -u
1000
$ mkdir /tmp/outside
$ podman run --rm -it -v /tmp/outside:/home/inside debian:stable-slim bash
root@2fbc9edaa0ee:/$ id -u
0
root@2fbc9edaa0ee:/$ touch /home/inside/hello_from_inside && exit
$ ls -l /tmp/outside
-rw-r--r-- 1 duke duke 0 31. Jul 06:55 hello_from_inside

Indeed, duke's UID of 1000 was mapped to 0 on the inside.

Since we are using rootless containers and not half-rootless containers we can let the blog and the db within their containers run in their own user namespaces too, what if they write logs to mounted volumes? That is when the subuid and subgid files come into play we configured earlier.

Lets say the blog-db container process should run in the namespace of the user dbduke. Since dbduke doesn't have root rights on the inside (as intended), dbduke won't also have rights to write to the mounted volume which is owned by duke. The easiest way to solve this problem is to simply chown the volume folder on the host to the mapped user of the container.


# scripts starts blog-db
# query the UID from the container and chown the volumes folder
UID_INSIDE=$(podman run --name UID_probe --rm blog-db /usr/bin/id -u)
podman unshare chown -R $UID_INSIDE /path/to/volume

podman run -v /path/to/volume:/path/inside ...  blog-db

Podman ships with a tool called unshare (the name is going to make less sense the longer you think about it) which lets you execute commands in the namespace of a different user. The command podman unshare allows to use the rights of duke to chown a folder to the internal UID of dbduke.

If we would check the folder rights from both sides, we would see that the UID was mapped from:


podman run --name UID_probe --rm blog-db /usr/bin/id -u
998
to

$ ls -l volume/
drwxr-xr-x 2 100997 100997 4096 31. Jul 07:54 logs
on the outside which is within the range specified in the /etc/subuid file - so everything works as intended. This allows user namespace isolation between containers (dbduke, wikiduke etc) and also between containers and the user on the host who launched the containers (duke himself).

And still no sudo in sight.

memory and cpu limits [and java]

Memory limits should work out of the box in rootless containers

$ podman run -it --rm -m 256m blog-db java -Xlog:os+container -version
[0.003s][info][os,container] Memory Limit is: 268435456
...

This allows the JVM to make smarter choices without having to provide absolute -Xmx flags (but you still can).

Setting CPU limits might not work out of the box without root (tested on Manjaro which is basically Arch), since the cgroups config might have user delegation turned off by default. But it is very easy to change:


# assuming your user id is 1000 like duke
$ sudo systemctl edit user@1000.service

# now modify the file so that it contains
[Service]
Delegate=yes

# and check if it worked
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.controllers
cpuset cpu io memory pids

You might have to reboot - it worked right away in my case.


# default CPU settings uses all cores
$ podman run -it --rm blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 4

# assign specific cores to container
$ podman run -it --rm --cpuset-cpus 1,2 blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 2

Container CPU core limits should become less relevant in the java world going forward, especially once projects like Loom [blog post] have been integrated. Since most things in java will run on virtual threads on top of a static carrier thread pool, it will be really easy to restrict the parallelism level of a JVM (basically -Djdk.defaultScheduler.parallelism=N and maybe another flag to limit max GC thread count).

But it works if you need it for rootless containers too.

class data sharing

Podman uses fuse-overlayfs for image management by default, which is overlayfs running in user mode.

$ podman info | grep -A5 overlay.mount_program
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: Unknown
      Version: |-
        fusermount3 version: 3.9.2
        fuse-overlayfs: version 1.1.0

This means that JVM class data sharing is also supported out of the box if the image containing the class data archive is shared in the image graph between multiple rootless containers.

The class data stored in debian-slim-jdk (a local image I created) will be mapped only once into memory and shared between all child containers. Which are in the example below blog-db, roller-jetty and wiki-jetty.


$ podman image tree --whatrequires debian-slim-jdk
Image ID: 57c885825969
Tags:     [localhost/debian-slim-jdk:latest]
Size:     340.1MB
Image Layers
└── ID: ab5467b188d7 Size: 267.6MB Top Layer of: [localhost/debian-slim-jdk:latest]
 ├── ID: fa04d6485aa5 Size:  7.68kB
 │   └── ID: 649de1f63ecc Size: 11.53MB Top Layer of: [localhost/debian-slim-jdk-jetty:latest]
 │    ├── ID: 34ce3917399d Size: 8.192kB
 │    │   └── ID: d128b826459d Size:  56.7MB Top Layer of: [localhost/roller-jetty:latest]
 │    ├── ID: 9a9c51927e42 Size: 8.192kB
 │    └── ID: d5c7176c1073 Size: 27.56MB Top Layer of: [localhost/wiki-jetty:latest]
 ├── ID: 06acb45dd590 Size:  7.68kB
 └── ID: 529ff7e97882 Size: 1.789MB Top Layer of: [localhost/blog-db:latest]

stopping java containers cleanly

A little tip in the end: In my experience JVMs are quite often launched from scripts in containers. This has automatically the side effect that the JVM process won't be PID 1, since it isn't the entry point. Commands like podman stop <container> will post SIGKILL to the shell script (!) wait 10s then simply kill the container process without the JVM ever knowing what was going on. More on that in the stopping containers blog entry.

Shutdown actions like dumping the JFR file won't be executed and IO writes might not have completed. So unless you trap the signal in the script and send it to the JVM somehow, there are more direct ways to stop a java container:


# script to cleanly shut down a container
$ podman exec -it $(container) sh -c\
 "kill \$(/jdk/bin/jps | grep -v Jps | cut -f1 -d' ')"

#... or if killall is installed (it usually isn't)
$ podman exec -it $(container) killall java

Once the JVM has cleanly shut down the launch script will finish which will be also noticed by the container once PID 1 is gone and it will cleanly shutdown too.

- - -

fly safe podman


// Running Debian Buster with 64bit mainline Kernel on a Raspberry Pi 3b+

The Raspberry Pi 3b+ runs on a Cortex A53 quad core processor which is using the Armv8 architecture. This allows it to run 64bit (AArch64) kernels, but be also backwards compatible with Armv7 and run 32bit (Armhf) kernels.

One of the most convenient ways to get started with a Raspberry Pi is to use the Debian based Raspbian distribution. Raspbian maintains its own customized linux kernel, drivers and firmware and provides 32bit OS images.

For my usecase however, I required the following:

  • get a server friendly 64bit OS running on the Pi3b+ (for example Debian)
  • use a mainline kernel
  • avoid proprietary binary blobs if possible
  • boot from a SSD instead of a SD card or USB stick (connected via USB adapter)

It turned out most of it was fairly easy to achieve thanks to the RaspberryPi3 debian port efforts. The raspi3-image-spec project on github contains instructions and the configuration for building a 64bit Debian Buster image which is using the mainline linux-image-arm64 kernels, ready to be booted from.

The image creation process is configured via raspi3.yaml which I modified to leave out wifi packages (i don't need wifi) and use netfilter tables (nft) as replacement for the now deprecated iptables but this could have been changed after initial boot too if desired.

The image will contain two partitions, the first is formatted as vfat labeled RASPIFIRM and contains the kernel, firmware and raspi specific boot configuration. The second ext4 partition is labeled RASPIROOT and contains the OS. Since partition labels are used the raspi should be able to find the partitions regardless of how they are connected or which drive type they are on (SSD, SD card reader, USB stick..).

Once the custom 64bit Debian image is built, simply write it to the SSD using dd, plug it in and power the raspi on. Once it booted don't forget to change the root password, create users, review ssh settings (disable root login), configure netfilter tables etc.

Some additional tweaks

Reduced the GPU memory RAM partition to 32MB to have more RAM left while still being able to use a terminal via a 1080p HDMI screen. Set cma=32M in /boot/firmware/cmdline.txt. I did also add a swapfile to free up additional memory by writing inactive pages to the SSD. The default image does not have swap configured.

Added
force_turbo=1

flag to /boot/firmware/config.txt. This will lock the ARM cores to their max clock frequency (1.4GHz). Since the linux kernel doesn't ship with a Pi 3 compatible cpufreq implementation yet, the Pi3b+ would just boot with idle clock (600Mhz) and never change it based on load. This is just a workaround until the CPU frequency scaling drivers made it into the kernel (update: commits made it into kernel 5.3). I didn't experience any CPU temperature issues so far, but i am also not running CPU intensive tasks.

The files will get overwritten on kernel updates, I added most of the /etc folder into a local git repository which makes diffs trivial and simplifies maintenance. If the raspi stops booting you can simply plug the SSD USB adapter into a workstation and edit the config files from there, diff, revert etc - very convenient.

USB Issues

I ran into an issue where the SSD connected via USB got disconnected after 5-30 days. This was very annoying to debug because it would not finish writing the kernel log file since virtually everything was on that disconnected drive. After swapping out everything hardware wise, compiling custom kernels (on the raspberry pi btw :) ), even falling back to the raspberry kernel which uses different USB drivers (dwc_otg instead of dwc2) I could not get rid of the disconnects - only delay it. Newer kernels seemed to improve the situation but what finally solved the issue for me was an old USB 2.0 hub i put in front of the SSD-USB adapter. Problem solved.


Thats about it. In case you are wondering what i am using the raspi for: This weblog you are reading right now is running on it (beside other things like a local wiki), for over six month by now, while only using 2-5W, fairly stable :)