// CRaC - Coordinated Restore at Checkpoint

Last year I experimented a little bit with the instant restoration of started and warmed up Java programs from disk, beside a couple of other potential use cases for checkpoints. To achieve this, I accessed a rootless build of CRIU directly from Java via its C/RPC-API (using Panama as binding layer). Although it worked surprisingly well, it quickly became clear that a proper implementation would require help from the JVM on a lower level and also an API to coordinate checkpoint/restore events between libraries.

I was pleased to see that there is a decent chance this might actually happen, since a new project with the name CRaC is currently in the voting stage to be officially started as OpenJDK sub-project. Lets take a look at the prototype.

update: CRaC has been approved (OpenJDK project, github).

With a little Help from the JVM

Why would checkpoint/restore benefit from JVM and OpenJDK support? Several reasons. CRIU does not like it when files change between C/R, a simple log file might spoil the fun if a JVM is restored, shut down and then restored again (which will fail). A JVM is also in an excellent position to run heap cleanup and compaction prior to calling CRIU to dump the process to disk. Checkpointing could be also done after driving the JVM into a safe point and making sure that everything stopped.

The CRaC prototype covers all of that already and more:

  • CheckpointException is thrown if files or sockets are open at a checkpoint
  • a simple API allows coordination with C/R events
  • Heap is cleaned, compacted and the checkpoint is made when the JVM reached a safe point
  • CRaC handles some JVM produced files automatically (no need to set -XX:-UsePerfData for example)
  • The jcmd tool can be used to checkpoint a JVM from a shell
  • CRIU is bundled in the JDK as a bonus - no need to have it installed

Since CRaC would be potentially part of OpenJDK one day, it could manage the files of JFR repositories automatically, and help with other tasks like the re-seeding SecureRandom instances or updating SSL certificates in future, which would be difficult (or impossible) to achieve as a third party library.

Coordinated Restore at Checkpoint

The API is very simple and somewhat similar to what I wrote for JCRIU, the main difference is that the current implementation does not allow the JVM to continue running after a checkpoint is created (But I don't see why this can't change in future).


Core.checkpointRestore();

serves currently both as checkpoint and program exit. It is also at the same time the entry point for a restore.


Core.getGlobalContext().register(resource);

A global context is used to register resources which will be notified before a checkpoint is created and in reverse order after the process is restored.

Minimal Example

Lets say we have a class CRACTest which can write Strings to a file (like a logger). To coordinate with C/Rs, it would need to close the file before checkpoint and reopen it after restore.


public class CRACTest implements Resource, AutoCloseable {

    private OutputStreamWriter writer;

    public CRACTest() {
        writer = newWriter();
        Core.getGlobalContext().register(this); // register as resource
    }
...
...
    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        System.out.println("resource pre-checkpoint");
        writer.close();
        writer = null;
    }

    @Override 
    public void afterRestore(Context<? extends Resource> context) throws Exception {
        System.out.println("resource post-restore");
        writer = newWriter();
    }
    
    public static void main(String[] args) throws IOException {
        System.out.println(Runtime.version());
        
        try (CRACTest writer = new CRACTest()) {
            writer.append("hello");
            try {
                System.out.println("pre-checkpoint PID: "+ProcessHandle.current().pid());
                Core.checkpointRestore();   // exit and restore point
                System.out.println("post-restore PID: "+ProcessHandle.current().pid());
            } catch (CheckpointException | RestoreException ex) {
                throw new RuntimeException("C/R failed", ex);
            }
            writer.append(" there!\n");
        }
    }
}

start + checkpoint + exit:


$CRaC/bin/java -XX:CRaCCheckpointTo=/tmp/cp -cp target/CRACTest-0.1-SNAPSHOT.jar dev.mbien.CRACTest
14-crac+0-adhoc..crac-jdk
pre-checkpoint PID: 12119
resource pre-checkpoint

restore at checkpoint:


$CRaC/bin/java -XX:CRaCRestoreFrom=/tmp/cp -cp target/CRACTest-0.1-SNAPSHOT.jar dev.mbien.CRACTest
resource post-restore
post-restore PID: 12119

lets see what we wrote to the file:


cat /tmp/test/CRACTest/out.txt
hello there!

restore 3 more times as a test:


./restore.sh
resource post-restore
post-restore PID: 12119
./restore.sh
resource post-restore
post-restore PID: 12119
./restore.sh
resource post-restore
post-restore PID: 12119

cat /tmp/test/CRACTest/out.txt
hello there!
 there!
 there!
 there!

works as expected.

What happens when we leave an io stream open? Lets remove writer.close() from beforeCheckpoint() and attempt to run a fresh instance.


./run.sh
14-crac+0-adhoc..crac-jdk
pre-checkpoint PID: 12431
resource pre-checkpoint
resource post-restore
Exception in thread "main" java.lang.RuntimeException: C/R failed
	at dev.mbien.cractest.CRACTest.main(CRACTest.java:72)
Caused by: jdk.crac.CheckpointException
	at java.base/jdk.crac.Core.checkpointRestore1(Core.java:134)
	at java.base/jdk.crac.Core.checkpointRestore(Core.java:177)
	at dev.mbien.cractest.CRACTest.main(CRACTest.java:69)
	Suppressed: jdk.crac.impl.CheckpointOpenFileException: /tmp/test/CRACTest/out.txt
		at java.base/jdk.crac.Core.translateJVMExceptions(Core.java:76)
		at java.base/jdk.crac.Core.checkpointRestore1(Core.java:137)
		... 2 more

The JVM will detect and tell us which files are still open before a checkpoint is attempted. In this case no checkpoint is made and the JVM continues. By adding this restriction, CRaC avoids a big list of potential restore failures.

Tool Integration

Checkpoints can be also triggered externally by using the jcmd tool.


jcmd 15119 JDK.checkpoint
15119:
Command executed successfully

Context and Resources

The Context itself implements Resource. This allows hierarchies of custom contexts to be registered to the global context. Since the context of a resource is passed to the beforeCheckpoint and afterRestore methods, it can be used to carry information to assist in C/R of specific resources.

Performance

As demonstrated with JCRIU, restoring initialized and warmed up Java applications can be really fast - CRaC however can be even faster due to the fact that the process image is much more compact. The average time to restore the JVM running this blog from a checkpoint using JCRIU was ~200 ms, while CRaC can restore JVMs in ~50 ms. Although this will depend on the size of the process image and IO read speed.

Potential use-cases beside instant restore

CRaC seems to be concentrating mainly on the use-case of restoring a started and warmed up JVM as fast as possible. This makes of course sense, since why would someone start a JVM in a container, on-demand, when it could have been already started when the container image was built? The purpose of the container is most likely to run business logic, not to start programs.

However, if CRaC would allow programs to continue running after a checkpoint, it would open up many other possibilities. For example:

  • time traveling debuggers, stepping backwards to past breakpoints (checkpoints)
  • snapshotting of a production JVM to restore and test/inspect it locally, do heap dumps etc
  • maybe some niche use-cases of periodic checkpoints and automatic restoration on failure (incremental dumps)
  • instantly starting IDEs (although this won't be a small task)

in any case... exciting times :)

Thanks to Anton Kozlov from Azul for immediately fixing a bug I encountered during testing.


- - - sidenotes - - -

jdk14-crac/lib/criu and jdk14-crac/lib/action-script might require cap_sys_ptrace to be set on some systems to not fail during restore.

The rootless mode for CRIU hasn't made it yet into the master branch which means that the JVM or criu has to be run with root privileges for now.

C/R of UI doesn't work at all, since disposing a window will still leave some cached resources behind (opened sockets, file descriptors etc) - but this is another aspect which could be only solved on the JDK level (although this won't be trivial).


// Defrosting Warmed-up Java [using Rootless CRIU and Project Panama]

I needed a toy project to experiment with JEP 389 of Project Panama (modern JNI) but wanted to take a better look at CRIU (Checkpoint/Restore In Userspace) too. So I thought, lets try to combine both and created JCRIU. The immediate questions I had were: how fast can it defrost a warmed up JVM and can it make a program time travel.

Lets attempt to investigate the first question with this blog entry.

CRIU Crash Course

CRIU can dump process trees to disk (checkpoint) and restore them any time later (implemented in user space) - its all in the name.

Lets run a minimal test first.


#!/bin/bash
echo my pid: $$
i=0
while true
do
    echo $i && ((i=i+1)) && sleep 1
done

The script above will print its PID initially and then continue to print and increment a number. It isn't important that this is a bash script, it could be any process.

shell 1:


$ sh test.sh 
my pid: 14255
0
1
...
9
Killed

shell 2:


$ criu dump -t 14255 --shell-job -v -D dump/
...
(00.021161) Dumping finished successfully

This command will let CRIU dump (checkpoint) the process with the specified PID and store its image in ./dump (overwriting any older image on the same path). The flag --shell-job tells CRIU that the process is attached to a console. Dumping a process will automatically kill it, like in this example, unless -R is specified.

shell 2:


$ criu restore --shell-job -D dump/
10
11
12
...

To restore, simply replace "dump" with "restore", without specifying the PID. As expected the program continues counting in shell 2, right where it was stopped in shell 1.

Rootless CRIU

As of now (Nov. 2020) the CRIU commands above still require root permissions. But this might change soon. Linux 5.9 received cap_checkpoint_restore (patch) and CRIU is also already being prepared. To test rootless CRIU, simply build the non-root branch and set cap_checkpoint_restore to the resulting binary (no need to install, you can use criu directly).


sudo setcap cap_checkpoint_restore=eip /path/to/criu/binary

Note: Dependent on your linux distribution you might have to set cap_sys_ptrace too. Some features might not work yet, for example restoring as --shell-job or using the CRIU API. Use a recent Kernel (at least 5.9.8) before trying to restore a JVM.

CRIU + Java + Panama = JCRIU

JCRIU uses Panama's jextract tool during build time to generate a low level (1:1) binding directly from the header of the CRIU API. The low level binding isn't exposed through the public API however, its just a implementation detail. Both jextract and the foreign function module are part of project Panama, early access builds are available here. JEP 389: Foreign Linker API has been (today) accepted for inclusion as JDK 16 incubator module - it might appear in mainline builds soon.

The main entry point is CRIUContext which implements AutoCloseable to cleanly dispose resources after use. Potential errors are mapped to CRIUExceptions. Checkpointing should be fairly robust since the communication is done over RPC with the actual CRIU process. Crashing CRIU most likely won't take the JVM down too.


    public static void main(String[] args) throws IOException, InterruptedException {
        
        // create empty dir for images
        Path image = Paths.get("checkpoint_test_image");

        if (!Files.exists(image))
            Files.createDirectory(image);
        
        // checkpoint the JVM every second
        try (CRIUContext criu = CRIUContext.create()
                .logLevel(WARNING).leaveRunning(true).shellJob(true)) {
            
            int n = 0;
            
            while(true) {
                Thread.sleep(1000);

                criu.checkpoint(image); // checkpoint and entry point for a restore

                long pid = ProcessHandle.current().pid()
                System.out.println("my PID: "+pid+" checkpont# "+n++);
            }
        }
    }

The above example is somewhat similar to the simple bash script. The main difference is that the Java program is checkpointing itself every second. This allows us to CTRL+C any time - the program will keep counting and checkpointing where it left of, if restored.


[mbien@longbow JCRIUTest]$ sudo sh start-demo.sh 
WARNING: Using incubator modules: jdk.incubator.foreign
my PID: 16195 checkpont# 0
my PID: 16195 checkpont# 1
my PID: 16195 checkpont# 2
my PID: 16195 checkpont# 3
my PID: 16195 checkpont# 4
my PID: 16195 checkpont# 5
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 5
my PID: 16195 checkpont# 6
my PID: 16195 checkpont# 7
my PID: 16195 checkpont# 8
my PID: 16195 checkpont# 9
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 9
my PID: 16195 checkpont# 10
my PID: 16195 checkpont# 11
my PID: 16195 checkpont# 12
my PID: 16195 checkpont# 13
my PID: 16195 checkpont# 14
CTRL+C

Note: start-demo.sh is just setting env variables to an early access JDK 16 panama build, enables jdk.incubator.foreign etc. The project README has the details.

Important Details and Considerations

  • CRIU restores images with the same PIDs the processes had during checkpoint. This won't cause much trouble in containers since the namespace should be quite empty, but might conflict from time to time on a workstation. If the same image should be restored multiple times concurrently, it will have to run in its own PID namespace. This can be achieved with sudo unshare -p -m -f [restore command]. See man unshare for details.
  • Opened files are not allowed to change (in size) between checkpoint and restore. If they do, the restore operation will fail. (watch out for log files, JFR repos, JVM perf data or temporary files)
  • If the application established TCP connections you have to tell CRIU that via the --tcp-established flag (or similar named method in CRIUContext). CRIU will try to restore all connections in their correct states. wiki link to more options
  • The first checkpoint or restore after system boot can take a few seconds because CRIU has to gather information about the system configuration first; this information is cached for subsequent uses
  • Some application dependent post-restore tasks might be required, for example keystore/cert replacement or RNG re-initialization (...)
  • CRIU can't checkpoint resources it can't reach. A X Window or state stored on a GPU can't be dumped
  • Migration should probably only be attempted between (very) similar systems and hardware

Instant Defrosting of Warmed-up JVMs

Lets take a look what you can do with super luminal, absolute zero, instant defrosting JCRIU (ok I'll stop ;)) when applied to my favorite dusty java web monolith: Apache Roller. I stopped the time this blog here would require to start on my workstation when loaded from a NVMe on JDK 16 + Jetty 9.4.34. (I consider it started when the website has loaded in the browser, not when the app server reports it started)

classic start: ~6.5 s

(for comparison: it takes about a minute to start on a Raspberry Pi 3b+, which is serving this page you are reading right now)

Now lets try this again. But this time Roller will warm itself up, generate RSS feeds, populate the in-memory cache, give the JIT a chance to compile hot paths, compact the heap by calling System.gc() and finally shock frost itself via criu.checkpoint(...).


        warmup();    // generates/caches landing page/RSS feeds and first 20 blog entries
        System.gc(); // give the GC a chance to clean up unused objects before checkpoint

        try (CRIUContext criu = CRIUContext.create()
                .logLevel(WARNING).leaveRunning(false).tcpEstablished(true)) {

            criu.checkpoint(imagePath);  // checkpoint + exit

        } catch (CRIUException ex) {
            jfrlog.warn("post warmup checkpoint failed", ex);
        }

(The uncompressed image size was between 500-600 MB during my tests, heap was set to 1 GB with ParallelGC active)

restore:


$ sudo time criu restore --shell-job --tcp-established -d -D blog_image/

real 0m0,204s
user 0m0,015s
sys  0m0,022s

instant defrosting: 204 ms

Note: -d detaches the shell after the restore operation completed. Alternative way to measure defrosting time is by enabling verbose logging with -v and comparing the last timestamp, this is slightly slower (+20ms) since CRIU tends to log a lot on lower log levels. Let me know if there is a better way of measuring this, but I double checked everything and the image loading speed would be well below the average read speed of my M.2 NVMe.

The blog is immediately reachable in the browser, served by a warmed-up JVM.

Conclusion && Discussion

CRIU is quite interesting for use cases where Java startup time matters. Quarkus for example moves slow framework initialization from startup to build time, native images with GraalVM further improve initialization by AOT compiling the application into a single binary, but this also sacrifices a little bit throughput. CRIU can be another tool in the toolbox to quickly map a running JVM with application into memory (no noteworthy code changes required).

The Foreign Linker API (JEP 389) is currently proposed as preview feature for OpenJDK 16, which is a major part of project Panama. However, to use JCRIU on older JDKs, another implementation for CRIUContext would be needed. A implementation which communicates via google protocol buffers with CRIU would completely avoid binding to the CRIU C-API for example.

The JVM would be in an excellent position to aid CRIU in many ways. It already is an operating system for Java/Bytecode based programs (soon even with its own implementation for threads) and knows how to drive itself to safe points (checkpointing an application which is under load is probably a bad idea), how to compact or resize the heap, invalidate code cache etc - I see great potential there.

Let me know what you think.

Thanks a lot to Adrian Reber (@adrian__reber) who patiently answered all my questions about CRIU.


// Stopping Containers Correctly

Stopping a container with


$ podman stop container-name
or

$ docker stop container-name

will send SIGTERM to the first process (PID 1) and shut down the container when the process terminates. If this doesn't happen within a certain time frame (default is 10s), the runtime will send SIGKILL to the process and take the container down.

So far so good, things are getting interesting when your container process isn't PID 1.

This is already the case if the process is started via a shell script.


#!/bin/bash

... 

java $FLAGS $APP

Attempting to stop this container will terminate the script, while the JVM will keep running. The container runtime is usually smart enough to notice that a process is still active after the script terminated and will wait the grace period anyway, before shutting down the container forcefully. The JVM however won't notice anything and won't have the opportunity to call shutdown hooks, write JFR dumps or finish transactions.

signal delegation

One way to solve this is by delegating the signal from the shell script to the main process:


... 
java $FLAGS $APP & # detach process from script
PID=$!             # remember process ID

trap 'kill -TERM $PID' INT TERM # delegate kill signal to JVM

wait $PID   # attach script to JVM again; note: TERM signal unblocks this wait
trap - TERM INT
wait $PID   # wait for JVM to exit after signal delegation
EXIT_STATUS=$?

The second wait prevents the script from exiting before the JVM finished termination and is required since the first wait is unblocked as soon the script received the signal.

it still didn't work

Interestingly, after implementing this (and trying out other variations of the same concept) it still didn't work for some reason - debugging showed the trap never fired.

Turns out that nothing was wrong with the signal delegation - signals just never reached the script :). So I searched a bit around and found this article which basically described the same async/wait/delegate method in greater detail (thats where I stole the EXIT_STATUS line from), so I knew it had to work. Another great article gave me the idea to check the Dockerfile again.


FROM ...
...
CMD ./start.sh

The sh shell interpreting the bash script was the first process!


$ podman ps
CONTAINER ID  IMAGE                     COMMAND             ...
de216106ff39  localhost/test:latest  /bin/sh -c ./start.sh  ...

htop (on the host) in tree view shows it more clearly:


$ htop
    1 root       ... /sbin/init
15643 podpilot   ...  - /usr/libexec/podman/conmon --api-version ...
15646 100996     ... |   - /bin/sh -c ./start.sh ...
15648 100996     ... |      - /bin/bash ./start.sh
15662 100996     ... |         - /home/jdk/bin/java -Xshare:on ...

To fix this a different CMD (or ENTRYPOINT) syntax is needed:


FROM ...
...
CMD [ "./start.sh" ]

Lets check again after rebuild + run:


$ podman ps
CONTAINER ID  IMAGE                     COMMAND         ...
72e3e60ed60b  localhost/test:latest  ./start.sh  ...

$ htop
    1 root       ... /sbin/init
15746 podpilot   ...  - /usr/libexec/podman/conmon --api-version ...
15749 100996     ... |   - /bin/bash ./start.sh ...
15771 100996     ... |      - /home/jdk/bin/java -Xshare:on ...

Much better. Since the script is now executed directly, it is able to receive and delegate the signals to the JVM. The Java Flight Recorder records also appeared in the volume, which meant that the JVM had enough time to convert the JFR repository into a single record file. The podman stop command also returned within a fraction of a second.

Since the trap is listening to SIGINT too, even the CTRL+C signal is properly propagated when the container is started in non-detached mode. Nice bonus for manual testing.

alternatives

Starting the JVM with


exec java $FLAGS $APP

will replace the shell process with the JVM process without changing PID or process name. Disadvantage: no java commandline in top and the shell won't execute any lines after the exec line (because it basically doesn't exist anymore).

... and if you don't care about the container life cycle too much you can always tell the JVM directly to shutdown, this will close all parent shells bottom-up until PID 1 terminated which will finally stop the container.


podman exec -it container sh -c "kill \$(jps | grep -v Jps | cut -f1 -d' ')"

- - -

lessons learned:
sometimes two square brackets make the difference :)


// [Java in] Rootless Containers with Podman

I have always been a little surprised how quickly it became acceptable to run applications wrapped in containers as root processes. Nobody would have run a web server as root before docker became mainstream if there was some way to avoid it. But with docker it became OK to have the docker daemon and the container processes all running as root. The first item in most docker tutorials became how to elevate your user rights so that you don't have to type sudo before every docker command.

But this doesn't have to be the case of course. One project I had an eye on was podman which is a container engine implementing the docker command line interface with quite good support for rootless operations. With the release of Podman 2.0.x (and the fact that it is slowly making it into the debian repositories) I started to experiment with it a bit more. (for the experimental rootless mode of Docker check out this page)

cgroups v2

Containers rely heavily on kernel namespaces and a feature called control groups. To properly run rootless containers the kernel must be supporting and running with cgroups v2 enabled. To check if cgroups v2 are enabled simply run:


ls /sys/fs/cgroup
cgroup.controllers  cgroup.max.depth  cgroup.max.descendants  cgroup.procs ...

If the files are prefixed with cgroup. you are running cgroups v2, if not, its still v1.

Many distributions will still run with cgroups v1 enabled by default for backwards compatibility. But you can enable cgroups v2 with the systemd kernel flag: systemd.unified_cgroup_hierarchy=1. To do this with grub for example:

  • edit /etc/default/grub and
  • add systemd.unified_cgroup_hierarchy=1 to the key GRUB_CMDLINE_LINUX_DEFAULT (space separated list)
  • then run sudo grub-mkconfig -o /boot/grub/grub.cfg and reboot.

... and make sure you are not running an ancient linux kernel.

crun

The underlying OCI implementation has to support cgroups v2 too. I tested mostly on crun which is a super fast and lightweight alternative to runc. The runtime can be passed to podman via the --runtime flag


podman --runtime /usr/bin/crun <commands>

but it got picked up automatically in my case after I installed the package (manjaro linux, runc is still installed too).


podman info | grep -A5 ociRuntime
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 0.14.1

subordinate uid and gids

The last step required to set up rootless containers are /etc/subuid and /etc/subgid. If the files don't exist yet, create them and add a mapping range from your user name to container users. For example the line:

duke:100000:65536

Gives duke the right to create 65536 users in container images, starting from UID 100000. Duke himself will be mapped by default to root (0) in the container. (Same must be done for groups in subgid). The range should never overlap with UIDs on the host system. Details in man subuid. More on users later in the volumes section.

rootless containers

Some things to keep in mind:
  • rootless podman runs containers with less privileges than the user which started the container
    • some of these restrictions can be lifted (via --privileged, for example)
    • but rootless containers will never have more privileges than the user that launched them
    • root in the container is the user on the host
  • rootless containers have no IP or MAC address, because nw device association requires root privileges
    • podman uses slirp4netns for user mode networking
    • pinging something from within a container won't work out of the box - but don't panic: it can be configured if desired

podman

Podman uses the same command-line interface as Docker and it also understands Dockerfiles. So if everything is configured correctly it should all look familiar:

$ podman version
Version:      2.0.2
API Version:  1
Go Version:   go1.14.4
Git Commit:   201c9505b88f451ca877d29a73ed0f1836bb96c7
Built:        Sun Jul 12 22:46:58 2020
OS/Arch:      linux/amd64

$ podman pull debian:stable-slim
...
$ podman images
REPOSITORY                       TAG          IMAGE ID      CREATED       SIZE
docker.io/library/debian         stable-slim  56fae066253c  4 days ago    72.5 MB
...
$ podman run --rm debian:stable-slim cat /etc/debian_version
10.4

Setting alias docker=podman allows existing scripts to be reused without modification - but I stick here to podman to not cause confusion.

container communication

Rootless containers don't have their own IP addresses but you can bind them to ports (>1024). Host2container communication works therefore analog to communicating with any service you would have running on the host.


$ podman run --name wiki --rm -d -p 8443:8443 jspwiki-jetty
$ podman port -a
fd4c06b454ee	8443/tcp -> 0.0.0.0:8443
$ firefox https://localhost:8443/wiki

To setup quick and dirty container2container communication you can let them communicate over the IP address of the host (or host name) and ports, if the firewall is OK with that. But a better maintainable approach are pods. Pods are groups of containers which belong together. It is basically a infrastructure container, containing the actual containers. All containers in a pod share the same localhost and use it for pod-local communication. The outside world is reached via opened ports on the pod.

Lets say we have a blog and a db. The blog requires the db but all the host cares about is the https port of the blog container. So we can simply put blog container and db container into a blog-pod and let both communicate via pod-local localhost (podhost?). The https port is opened on the blog-pod for the host while the db isn't reachable from the outside.


$ podman pod create --name blogpod -p 8443:8443

# note: a pod starts out with one container already in it,
# it is the infrastructure container - basically the pod itself
$ podman pod list
POD ID        NAME     STATUS   CREATED        # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  7 seconds ago  1                af7baf0e7fde

$ podman run --pod blogpod --name blogdb --rm -d blog-db
$ podman run --pod blogpod --name apacheroller --rm -d roller-jetty

$ podman pod list
POD ID        NAME     STATUS   CREATED         # OF CONTAINERS  INFRA ID
39ad88b8892f  blogpod  Created  30 seconds ago  3                af7baf0e7fde

$ firefox https://localhost:8443/blog

Now we already have two containers able to communicate with each other and a host which is able to communicate with a container in the pod - and no sudo in sight.

volumes and users

We already know that the user on the outside is root on the inside, but lets quickly check it just to be sure:


$ whoami
duke
$ id -u
1000
$ mkdir /tmp/outside
$ podman run --rm -it -v /tmp/outside:/home/inside debian:stable-slim bash
root@2fbc9edaa0ee:/$ id -u
0
root@2fbc9edaa0ee:/$ touch /home/inside/hello_from_inside && exit
$ ls -l /tmp/outside
-rw-r--r-- 1 duke duke 0 31. Jul 06:55 hello_from_inside

Indeed, duke's UID of 1000 was mapped to 0 on the inside.

Since we are using rootless containers and not half-rootless containers we can let the blog and the db within their containers run in their own user namespaces too, what if they write logs to mounted volumes? That is when the subuid and subgid files come into play we configured earlier.

Lets say the blog-db container process should run in the namespace of the user dbduke. Since dbduke doesn't have root rights on the inside (as intended), dbduke won't also have rights to write to the mounted volume which is owned by duke. The easiest way to solve this problem is to simply chown the volume folder on the host to the mapped user of the container.


# scripts starts blog-db
# query the UID from the container and chown the volumes folder
UID_INSIDE=$(podman run --name UID_probe --rm blog-db /usr/bin/id -u)
podman unshare chown -R $UID_INSIDE /path/to/volume

podman run -v /path/to/volume:/path/inside ...  blog-db

Podman ships with a tool called unshare (the name is going to make less sense the longer you think about it) which lets you execute commands in the namespace of a different user. The command podman unshare allows to use the rights of duke to chown a folder to the internal UID of dbduke.

If we would check the folder rights from both sides, we would see that the UID was mapped from:


podman run --name UID_probe --rm blog-db /usr/bin/id -u
998
to

$ ls -l volume/
drwxr-xr-x 2 100997 100997 4096 31. Jul 07:54 logs
on the outside which is within the range specified in the /etc/subuid file - so everything works as intended. This allows user namespace isolation between containers (dbduke, wikiduke etc) and also between containers and the user on the host who launched the containers (duke himself).

And still no sudo in sight.

memory and cpu limits [and java]

Memory limits should work out of the box in rootless containers

$ podman run -it --rm -m 256m blog-db java -Xlog:os+container -version
[0.003s][info][os,container] Memory Limit is: 268435456
...

This allows the JVM to make smarter choices without having to provide absolute -Xmx flags (but you still can).

Setting CPU limits might not work out of the box without root (tested on Manjaro which is basically Arch), since the cgroups config might have user delegation turned off by default. But it is very easy to change:


# assuming your user id is 1000 like duke
$ sudo systemctl edit user@1000.service

# now modify the file so that it contains
[Service]
Delegate=yes

# and check if it worked
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.controllers
cpuset cpu io memory pids

You might have to reboot - it worked right away in my case.


# default CPU settings uses all cores
$ podman run -it --rm blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 4

# assign specific cores to container
$ podman run -it --rm --cpuset-cpus 1,2 blog-db sh -c\
 "echo 'Runtime.getRuntime().availableProcessors();/exit' | jshell -q"
jshell> Runtime.getRuntime().availableProcessors()$1 ==> 2

Container CPU core limits should become less relevant in the java world going forward, especially once projects like Loom [blog post] have been integrated. Since most things in java will run on virtual threads on top of a static carrier thread pool, it will be really easy to restrict the parallelism level of a JVM (basically -Djdk.defaultScheduler.parallelism=N and maybe another flag to limit max GC thread count).

But it works if you need it for rootless containers too.

class data sharing

Podman uses fuse-overlayfs for image management by default, which is overlayfs running in user mode.

$ podman info | grep -A5 overlay.mount_program
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: Unknown
      Version: |-
        fusermount3 version: 3.9.2
        fuse-overlayfs: version 1.1.0

This means that JVM class data sharing is also supported out of the box if the image containing the class data archive is shared in the image graph between multiple rootless containers.

The class data stored in debian-slim-jdk (a local image I created) will be mapped only once into memory and shared between all child containers. Which are in the example below blog-db, roller-jetty and wiki-jetty.


$ podman image tree --whatrequires debian-slim-jdk
Image ID: 57c885825969
Tags:     [localhost/debian-slim-jdk:latest]
Size:     340.1MB
Image Layers
└── ID: ab5467b188d7 Size: 267.6MB Top Layer of: [localhost/debian-slim-jdk:latest]
 ├── ID: fa04d6485aa5 Size:  7.68kB
 │   └── ID: 649de1f63ecc Size: 11.53MB Top Layer of: [localhost/debian-slim-jdk-jetty:latest]
 │    ├── ID: 34ce3917399d Size: 8.192kB
 │    │   └── ID: d128b826459d Size:  56.7MB Top Layer of: [localhost/roller-jetty:latest]
 │    ├── ID: 9a9c51927e42 Size: 8.192kB
 │    └── ID: d5c7176c1073 Size: 27.56MB Top Layer of: [localhost/wiki-jetty:latest]
 ├── ID: 06acb45dd590 Size:  7.68kB
 └── ID: 529ff7e97882 Size: 1.789MB Top Layer of: [localhost/blog-db:latest]

stopping java containers cleanly

A little tip in the end: In my experience JVMs are quite often launched from scripts in containers. This has automatically the side effect that the JVM process won't be PID 1, since it isn't the entry point. Commands like podman stop <container> will post SIGKILL to the shell script (!) wait 10s then simply kill the container process without the JVM ever knowing what was going on. More on that in the stopping containers blog entry.

Shutdown actions like dumping the JFR file won't be executed and IO writes might not have completed. So unless you trap the signal in the script and send it to the JVM somehow, there are more direct ways to stop a java container:


# script to cleanly shut down a container
$ podman exec -it $(container) sh -c\
 "kill \$(/jdk/bin/jps | grep -v Jps | cut -f1 -d' ')"

#... or if killall is installed (it usually isn't)
$ podman exec -it $(container) killall java

Once the JVM has cleanly shut down the launch script will finish which will be also noticed by the container once PID 1 is gone and it will cleanly shutdown too.

- - -

fly safe podman


// Running Debian Buster with 64bit mainline Kernel on a Raspberry Pi 3b+

The Raspberry Pi 3b+ runs on a Cortex A53 quad core processor which is using the Armv8 architecture. This allows it to run 64bit (AArch64) kernels, but be also backwards compatible with Armv7 and run 32bit (Armhf) kernels.

One of the most convenient ways to get started with a Raspberry Pi is to use the Debian based Raspbian distribution. Raspbian maintains its own customized linux kernel, drivers and firmware and provides 32bit OS images.

For my usecase however, I required the following:

  • get a server friendly 64bit OS running on the Pi3b+ (for example Debian)
  • use a mainline kernel
  • avoid proprietary binary blobs if possible
  • boot from a SSD instead of a SD card or USB stick (connected via USB adapter)

It turned out most of it was fairly easy to achieve thanks to the RaspberryPi3 debian port efforts. The raspi3-image-spec project on github contains instructions and the configuration for building a 64bit Debian Buster image which is using the mainline linux-image-arm64 kernels, ready to be booted from.

The image creation process is configured via raspi3.yaml which I modified to leave out wifi packages (i don't need wifi) and use netfilter tables (nft) as replacement for the now deprecated iptables but this could have been changed after initial boot too if desired.

The image will contain two partitions, the first is formatted as vfat labeled RASPIFIRM and contains the kernel, firmware and raspi specific boot configuration. The second ext4 partition is labeled RASPIROOT and contains the OS. Since partition labels are used the raspi should be able to find the partitions regardless of how they are connected or which drive type they are on (SSD, SD card reader, USB stick..).

Once the custom 64bit Debian image is built, simply write it to the SSD using dd, plug it in and power the raspi on. Once it booted don't forget to change the root password, create users, review ssh settings (disable root login), configure netfilter tables etc.

Some additional tweaks

Reduced the GPU memory RAM partition to 32MB to have more RAM left while still being able to use a terminal via a 1080p HDMI screen. Set cma=32M in /boot/firmware/cmdline.txt. I did also add a swapfile to free up additional memory by writing inactive pages to the SSD. The default image does not have swap configured.

Added
force_turbo=1

flag to /boot/firmware/config.txt. This will lock the ARM cores to their max clock frequency (1.4GHz). Since the linux kernel doesn't ship with a Pi 3 compatible cpufreq implementation yet, the Pi3b+ would just boot with idle clock (600Mhz) and never change it based on load. This is just a workaround until the CPU frequency scaling drivers made it into the kernel (update: commits made it into kernel 5.3). I didn't experience any CPU temperature issues so far, but i am also not running CPU intensive tasks.

The files will get overwritten on kernel updates, I added most of the /etc folder into a local git repository which makes diffs trivial and simplifies maintenance. If the raspi stops booting you can simply plug the SSD USB adapter into a workstation and edit the config files from there, diff, revert etc - very convenient.

USB Issues

I ran into an issue where the SSD connected via USB got disconnected after 5-30 days. This was very annoying to debug because it would not finish writing the kernel log file since virtually everything was on that disconnected drive. After swapping out everything hardware wise, compiling custom kernels (on the raspberry pi btw :) ), even falling back to the raspberry kernel which uses different USB drivers (dwc_otg instead of dwc2) I could not get rid of the disconnects - only delay it. Newer kernels seemed to improve the situation but what finally solved the issue for me was an old USB 2.0 hub i put in front of the SSD-USB adapter. Problem solved.


Thats about it. In case you are wondering what i am using the raspi for: This weblog you are reading right now is running on it (beside other things like a local wiki), for over six month by now, while only using 2-5W, fairly stable :)


// Cleaning Bash History using a Java 11 Single-File Sourcecode Program

Java 11 adds with JEP330 the ability to launch a Java source file directly from the command line without requiring to explicitly compile it. I can see this feature being convenient for some simple scripting or quick tests. Although Java isn't the most concise language it still has the benefit of being quite readable and having powerful utility APIs.

The following example reads the bash shell history from a file into a LinkedHashSet to retain ordering and keeps the most recently used line if duplicates exist, essentially cleaning up the history.


import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.LinkedHashSet;
import java.util.HashSet;
import java.util.stream.Stream;

public class HistoryCleaner {

    public static void main(String[] args) throws IOException {

        Path path = Path.of(System.getProperty("user.home")+"/.bash_history");

        HashSet<String> deduplicated = new LinkedHashSet<>(1024);
        try (Stream<String> lines = Files.lines(path)) {
            lines.forEach(line -> {
                deduplicated.remove(line);
                deduplicated.add(line);
            });
        }

        Files.write(path, deduplicated);
    }

}


duke@virtual1:~$ java HistoryCleaner.java

will compile and run the "script".


Shebangs are supported too


#!/usr/bin/java --source 11 -Xmx32m -XX:+UseSerialGC

import java.io....

However, since this is not part of the java language specification, single-file java programs (SFJP) with a #! (shebang) in the first line are technically no regular java files and will not compile when used with javac. This also means that the common file naming conventions are not required for pure SFJPs (i.e. file name does not have to be the same as the class name).

Using a different file extension is advised. We can't call it .js - so lets call it .sfjp :).


duke@virtual1:~$ ./HistoryCleaner.sfjp

This will compile and run the HistoryCleaner SFJP containing the shebang declaration in the first line. Keeping the .java file extension would result in a compiler error. As bonus we can also set JVM flags using the shebang mechanism. Keep in mind that -source [version] must be the first flag to tell the java launcher that it should expect a source file instead of bytecode.