// CRaC - Coordinated Restore at Checkpoint

Last year I experimented a little bit with the instant restoration of started and warmed up Java programs from disk, beside a couple of other potential use cases for checkpoints. To achieve this, I accessed a rootless build of CRIU directly from Java via its C/RPC-API (using Panama as binding layer). Although it worked surprisingly well, it quickly became clear that a proper implementation would require help from the JVM on a lower level and also an API to coordinate checkpoint/restore events between libraries.

I was pleased to see that there is a decent chance this might actually happen, since a new project with the name CRaC is currently in the voting stage to be officially started as OpenJDK sub-project. Lets take a look at the prototype.

update: CRaC has been approved (OpenJDK project, github).

With a little Help from the JVM

Why would checkpoint/restore benefit from JVM and OpenJDK support? Several reasons. CRIU does not like it when files change between C/R, a simple log file might spoil the fun if a JVM is restored, shut down and then restored again (which will fail). A JVM is also in an excellent position to run heap cleanup and compaction prior to calling CRIU to dump the process to disk. Checkpointing could be also done after driving the JVM into a safe point and making sure that everything stopped.

The CRaC prototype covers all of that already and more:

  • CheckpointException is thrown if files or sockets are open at a checkpoint
  • a simple API allows coordination with C/R events
  • Heap is cleaned, compacted and the checkpoint is made when the JVM reached a safe point
  • CRaC handles some JVM produced files automatically (no need to set -XX:-UsePerfData for example)
  • The jcmd tool can be used to checkpoint a JVM from a shell
  • CRIU is bundled in the JDK as a bonus - no need to have it installed

Since CRaC would be potentially part of OpenJDK one day, it could manage the files of JFR repositories automatically, and help with other tasks like the re-seeding SecureRandom instances or updating SSL certificates in future, which would be difficult (or impossible) to achieve as a third party library.

Coordinated Restore at Checkpoint

The API is very simple and somewhat similar to what I wrote for JCRIU, the main difference is that the current implementation does not allow the JVM to continue running after a checkpoint is created (But I don't see why this can't change in future).


Core.checkpointRestore();

serves currently both as checkpoint and program exit. It is also at the same time the entry point for a restore.


Core.getGlobalContext().register(resource);

A global context is used to register resources which will be notified before a checkpoint is created and in reverse order after the process is restored.

Minimal Example

Lets say we have a class CRACTest which can write Strings to a file (like a logger). To coordinate with C/Rs, it would need to close the file before checkpoint and reopen it after restore.


public class CRACTest implements Resource, AutoCloseable {

    private OutputStreamWriter writer;

    public CRACTest() {
        writer = newWriter();
        Core.getGlobalContext().register(this); // register as resource
    }
...
...
    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        System.out.println("resource pre-checkpoint");
        writer.close();
        writer = null;
    }

    @Override 
    public void afterRestore(Context<? extends Resource> context) throws Exception {
        System.out.println("resource post-restore");
        writer = newWriter();
    }
    
    public static void main(String[] args) throws IOException {
        System.out.println(Runtime.version());
        
        try (CRACTest writer = new CRACTest()) {
            writer.append("hello");
            try {
                System.out.println("pre-checkpoint PID: "+ProcessHandle.current().pid());
                Core.checkpointRestore();   // exit and restore point
                System.out.println("post-restore PID: "+ProcessHandle.current().pid());
            } catch (CheckpointException | RestoreException ex) {
                throw new RuntimeException("C/R failed", ex);
            }
            writer.append(" there!\n");
        }
    }
}

start + checkpoint + exit:


$CRaC/bin/java -XX:CRaCCheckpointTo=/tmp/cp -cp target/CRACTest-0.1-SNAPSHOT.jar dev.mbien.CRACTest
14-crac+0-adhoc..crac-jdk
pre-checkpoint PID: 12119
resource pre-checkpoint

restore at checkpoint:


$CRaC/bin/java -XX:CRaCRestoreFrom=/tmp/cp -cp target/CRACTest-0.1-SNAPSHOT.jar dev.mbien.CRACTest
resource post-restore
post-restore PID: 12119

lets see what we wrote to the file:


cat /tmp/test/CRACTest/out.txt
hello there!

restore 3 more times as a test:


./restore.sh
resource post-restore
post-restore PID: 12119
./restore.sh
resource post-restore
post-restore PID: 12119
./restore.sh
resource post-restore
post-restore PID: 12119

cat /tmp/test/CRACTest/out.txt
hello there!
 there!
 there!
 there!

works as expected.

What happens when we leave an io stream open? Lets remove writer.close() from beforeCheckpoint() and attempt to run a fresh instance.


./run.sh
14-crac+0-adhoc..crac-jdk
pre-checkpoint PID: 12431
resource pre-checkpoint
resource post-restore
Exception in thread "main" java.lang.RuntimeException: C/R failed
	at dev.mbien.cractest.CRACTest.main(CRACTest.java:72)
Caused by: jdk.crac.CheckpointException
	at java.base/jdk.crac.Core.checkpointRestore1(Core.java:134)
	at java.base/jdk.crac.Core.checkpointRestore(Core.java:177)
	at dev.mbien.cractest.CRACTest.main(CRACTest.java:69)
	Suppressed: jdk.crac.impl.CheckpointOpenFileException: /tmp/test/CRACTest/out.txt
		at java.base/jdk.crac.Core.translateJVMExceptions(Core.java:76)
		at java.base/jdk.crac.Core.checkpointRestore1(Core.java:137)
		... 2 more

The JVM will detect and tell us which files are still open before a checkpoint is attempted. In this case no checkpoint is made and the JVM continues. By adding this restriction, CRaC avoids a big list of potential restore failures.

Tool Integration

Checkpoints can be also triggered externally by using the jcmd tool.


jcmd 15119 JDK.checkpoint
15119:
Command executed successfully

Context and Resources

The Context itself implements Resource. This allows hierarchies of custom contexts to be registered to the global context. Since the context of a resource is passed to the beforeCheckpoint and afterRestore methods, it can be used to carry information to assist in C/R of specific resources.

Performance

As demonstrated with JCRIU, restoring initialized and warmed up Java applications can be really fast - CRaC however can be even faster due to the fact that the process image is much more compact. The average time to restore the JVM running this blog from a checkpoint using JCRIU was ~200 ms, while CRaC can restore JVMs in ~50 ms. Although this will depend on the size of the process image and IO read speed.

Potential use-cases beside instant restore

CRaC seems to be concentrating mainly on the use-case of restoring a started and warmed up JVM as fast as possible. This makes of course sense, since why would someone start a JVM in a container, on-demand, when it could have been already started when the container image was built? The purpose of the container is most likely to run business logic, not to start programs.

However, if CRaC would allow programs to continue running after a checkpoint, it would open up many other possibilities. For example:

  • time traveling debuggers, stepping backwards to past breakpoints (checkpoints)
  • snapshotting of a production JVM to restore and test/inspect it locally, do heap dumps etc
  • maybe some niche use-cases of periodic checkpoints and automatic restoration on failure (incremental dumps)
  • instantly starting IDEs (although this won't be a small task)

in any case... exciting times :)

Thanks to Anton Kozlov from Azul for immediately fixing a bug I encountered during testing.


- - - sidenotes - - -

jdk14-crac/lib/criu and jdk14-crac/lib/action-script might require cap_sys_ptrace to be set on some systems to not fail during restore.

The rootless mode for CRIU hasn't made it yet into the master branch which means that the JVM or criu has to be run with root privileges for now.

C/R of UI doesn't work at all, since disposing a window will still leave some cached resources behind (opened sockets, file descriptors etc) - but this is another aspect which could be only solved on the JDK level (although this won't be trivial).


// Defrosting Warmed-up Java [using Rootless CRIU and Project Panama]

I needed a toy project to experiment with JEP 389 of Project Panama (modern JNI) but wanted to take a better look at CRIU (Checkpoint/Restore In Userspace) too. So I thought, lets try to combine both and created JCRIU. The immediate questions I had were: how fast can it defrost a warmed up JVM and can it make a program time travel.

Lets attempt to investigate the first question with this blog entry.

CRIU Crash Course

CRIU can dump process trees to disk (checkpoint) and restore them any time later (implemented in user space) - its all in the name.

Lets run a minimal test first.


#!/bin/bash
echo my pid: $$
i=0
while true
do
    echo $i && ((i=i+1)) && sleep 1
done

The script above will print its PID initially and then continue to print and increment a number. It isn't important that this is a bash script, it could be any process.

shell 1:


$ sh test.sh 
my pid: 14255
0
1
...
9
Killed

shell 2:


$ criu dump -t 14255 --shell-job -v -D dump/
...
(00.021161) Dumping finished successfully

This command will let CRIU dump (checkpoint) the process with the specified PID and store its image in ./dump (overwriting any older image on the same path). The flag --shell-job tells CRIU that the process is attached to a console. Dumping a process will automatically kill it, like in this example, unless -R is specified.

shell 2:


$ criu restore --shell-job -D dump/
10
11
12
...

To restore, simply replace "dump" with "restore", without specifying the PID. As expected the program continues counting in shell 2, right where it was stopped in shell 1.

Rootless CRIU

As of now (Nov. 2020) the CRIU commands above still require root permissions. But this might change soon. Linux 5.9 received cap_checkpoint_restore (patch) and CRIU is also already being prepared. To test rootless CRIU, simply build the non-root branch and set cap_checkpoint_restore to the resulting binary (no need to install, you can use criu directly).


sudo setcap cap_checkpoint_restore=eip /path/to/criu/binary

Note: Dependent on your linux distribution you might have to set cap_sys_ptrace too. Some features might not work yet, for example restoring as --shell-job or using the CRIU API. Use a recent Kernel (at least 5.9.8) before trying to restore a JVM.

CRIU + Java + Panama = JCRIU

JCRIU uses Panama's jextract tool during build time to generate a low level (1:1) binding directly from the header of the CRIU API. The low level binding isn't exposed through the public API however, its just a implementation detail. Both jextract and the foreign function module are part of project Panama, early access builds are available here. JEP 389: Foreign Linker API has been (today) accepted for inclusion as JDK 16 incubator module - it might appear in mainline builds soon.

The main entry point is CRIUContext which implements AutoCloseable to cleanly dispose resources after use. Potential errors are mapped to CRIUExceptions. Checkpointing should be fairly robust since the communication is done over RPC with the actual CRIU process. Crashing CRIU most likely won't take the JVM down too.


    public static void main(String[] args) throws IOException, InterruptedException {
        
        // create empty dir for images
        Path image = Paths.get("checkpoint_test_image");

        if (!Files.exists(image))
            Files.createDirectory(image);
        
        // checkpoint the JVM every second
        try (CRIUContext criu = CRIUContext.create()
                .logLevel(WARNING).leaveRunning(true).shellJob(true)) {
            
            int n = 0;
            
            while(true) {
                Thread.sleep(1000);

                criu.checkpoint(image); // checkpoint and entry point for a restore

                long pid = ProcessHandle.current().pid()
                System.out.println("my PID: "+pid+" checkpont# "+n++);
            }
        }
    }

The above example is somewhat similar to the simple bash script. The main difference is that the Java program is checkpointing itself every second. This allows us to CTRL+C any time - the program will keep counting and checkpointing where it left of, if restored.


[mbien@longbow JCRIUTest]$ sudo sh start-demo.sh 
WARNING: Using incubator modules: jdk.incubator.foreign
my PID: 16195 checkpont# 0
my PID: 16195 checkpont# 1
my PID: 16195 checkpont# 2
my PID: 16195 checkpont# 3
my PID: 16195 checkpont# 4
my PID: 16195 checkpont# 5
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 5
my PID: 16195 checkpont# 6
my PID: 16195 checkpont# 7
my PID: 16195 checkpont# 8
my PID: 16195 checkpont# 9
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 9
my PID: 16195 checkpont# 10
my PID: 16195 checkpont# 11
my PID: 16195 checkpont# 12
my PID: 16195 checkpont# 13
my PID: 16195 checkpont# 14
CTRL+C

Note: start-demo.sh is just setting env variables to an early access JDK 16 panama build, enables jdk.incubator.foreign etc. The project README has the details.

Important Details and Considerations

  • CRIU restores images with the same PIDs the processes had during checkpoint. This won't cause much trouble in containers since the namespace should be quite empty, but might conflict from time to time on a workstation. If the same image should be restored multiple times concurrently, it will have to run in its own PID namespace. This can be achieved with sudo unshare -p -m -f [restore command]. See man unshare for details.
  • Opened files are not allowed to change (in size) between checkpoint and restore. If they do, the restore operation will fail. (watch out for log files, JFR repos, JVM perf data or temporary files)
  • If the application established TCP connections you have to tell CRIU that via the --tcp-established flag (or similar named method in CRIUContext). CRIU will try to restore all connections in their correct states. wiki link to more options
  • The first checkpoint or restore after system boot can take a few seconds because CRIU has to gather information about the system configuration first; this information is cached for subsequent uses
  • Some application dependent post-restore tasks might be required, for example keystore/cert replacement or RNG re-initialization (...)
  • CRIU can't checkpoint resources it can't reach. A X Window or state stored on a GPU can't be dumped
  • Migration should probably only be attempted between (very) similar systems and hardware

Instant Defrosting of Warmed-up JVMs

Lets take a look what you can do with super luminal, absolute zero, instant defrosting JCRIU (ok I'll stop ;)) when applied to my favorite dusty java web monolith: Apache Roller. I stopped the time this blog here would require to start on my workstation when loaded from a NVMe on JDK 16 + Jetty 9.4.34. (I consider it started when the website has loaded in the browser, not when the app server reports it started)

classic start: ~6.5 s

(for comparison: it takes about a minute to start on a Raspberry Pi 3b+, which is serving this page you are reading right now)

Now lets try this again. But this time Roller will warm itself up, generate RSS feeds, populate the in-memory cache, give the JIT a chance to compile hot paths, compact the heap by calling System.gc() and finally shock frost itself via criu.checkpoint(...).


        warmup();    // generates/caches landing page/RSS feeds and first 20 blog entries
        System.gc(); // give the GC a chance to clean up unused objects before checkpoint

        try (CRIUContext criu = CRIUContext.create()
                .logLevel(WARNING).leaveRunning(false).tcpEstablished(true)) {

            criu.checkpoint(imagePath);  // checkpoint + exit

        } catch (CRIUException ex) {
            jfrlog.warn("post warmup checkpoint failed", ex);
        }

(The uncompressed image size was between 500-600 MB during my tests, heap was set to 1 GB with ParallelGC active)

restore:


$ sudo time criu restore --shell-job --tcp-established -d -D blog_image/

real 0m0,204s
user 0m0,015s
sys  0m0,022s

instant defrosting: 204 ms

Note: -d detaches the shell after the restore operation completed. Alternative way to measure defrosting time is by enabling verbose logging with -v and comparing the last timestamp, this is slightly slower (+20ms) since CRIU tends to log a lot on lower log levels. Let me know if there is a better way of measuring this, but I double checked everything and the image loading speed would be well below the average read speed of my M.2 NVMe.

The blog is immediately reachable in the browser, served by a warmed-up JVM.

Conclusion && Discussion

CRIU is quite interesting for use cases where Java startup time matters. Quarkus for example moves slow framework initialization from startup to build time, native images with GraalVM further improve initialization by AOT compiling the application into a single binary, but this also sacrifices a little bit throughput. CRIU can be another tool in the toolbox to quickly map a running JVM with application into memory (no noteworthy code changes required).

The Foreign Linker API (JEP 389) is currently proposed as preview feature for OpenJDK 16, which is a major part of project Panama. However, to use JCRIU on older JDKs, another implementation for CRIUContext would be needed. A implementation which communicates via google protocol buffers with CRIU would completely avoid binding to the CRIU C-API for example.

The JVM would be in an excellent position to aid CRIU in many ways. It already is an operating system for Java/Bytecode based programs (soon even with its own implementation for threads) and knows how to drive itself to safe points (checkpointing an application which is under load is probably a bad idea), how to compact or resize the heap, invalidate code cache etc - I see great potential there.

Let me know what you think.

Thanks a lot to Adrian Reber (@adrian__reber) who patiently answered all my questions about CRIU.