// Defrosting Warmed-up Java [using Rootless CRIU and Project Panama]
I needed a toy project to experiment with JEP 389 of Project Panama (modern JNI) but wanted to take a better look at CRIU (Checkpoint/Restore In Userspace) too. So I thought, lets try to combine both and created JCRIU. The immediate questions I had were: how fast can it defrost a warmed up JVM and can it make a program time travel.
Lets attempt to investigate the first question with this blog entry.
CRIU Crash Course
CRIU can dump process trees to disk (checkpoint) and restore them any time later (implemented in user space) - its all in the name.
Lets run a minimal test first.
#!/bin/bash
echo my pid: $$
i=0
while true
do
echo $i && ((i=i+1)) && sleep 1
done
The script above will print its PID initially and then continue to print and increment a number. It isn't important that this is a bash script, it could be any process.
shell 1:
$ sh test.sh
my pid: 14255
0
1
...
9
Killed
shell 2:
$ criu dump -t 14255 --shell-job -v -D dump/
...
(00.021161) Dumping finished successfully
This command will let CRIU dump (checkpoint) the process with the specified PID and store its image in ./dump
(overwriting any older image on the same path). The flag --shell-job
tells CRIU that the process is attached to a console. Dumping a process will automatically kill it, like in this example, unless -R
is specified.
shell 2:
$ criu restore --shell-job -D dump/
10
11
12
...
To restore, simply replace "dump" with "restore", without specifying the PID. As expected the program continues counting in shell 2, right where it was stopped in shell 1.
Rootless CRIU
As of now (Nov. 2020) the CRIU commands above still require root permissions. But this might change soon. Linux 5.9 received cap_checkpoint_restore
(patch) and CRIU is also already being prepared.
To test rootless CRIU, simply build the non-root branch and set cap_checkpoint_restore
to the resulting binary (no need to install, you can use criu
directly).
sudo setcap cap_checkpoint_restore=eip /path/to/criu/binary
Note: Dependent on your linux distribution you might have to set cap_sys_ptrace
too. Some features might not work yet, for example restoring as --shell-job
or using the CRIU API. Use a recent Kernel (at least 5.9.8) before trying to restore a JVM.
CRIU + Java + Panama = JCRIU
JCRIU uses Panama's jextract
tool during build time to generate a low level (1:1) binding directly from the header of the CRIU API. The low level binding isn't exposed through the public API however, its just a implementation detail. Both jextract
and the foreign function module are part of project Panama, early access builds are available here. JEP 389: Foreign Linker API has been (today) accepted for inclusion as JDK 16 incubator module - it might appear in mainline builds soon.
The main entry point is CRIUContext
which implements AutoCloseable
to cleanly dispose resources after use. Potential errors are mapped to CRIUException
s. Checkpointing should be fairly robust since the communication is done over RPC with the actual CRIU process. Crashing CRIU most likely won't take the JVM down too.
public static void main(String[] args) throws IOException, InterruptedException {
// create empty dir for images
Path image = Paths.get("checkpoint_test_image");
if (!Files.exists(image))
Files.createDirectory(image);
// checkpoint the JVM every second
try (CRIUContext criu = CRIUContext.create()
.logLevel(WARNING).leaveRunning(true).shellJob(true)) {
int n = 0;
while(true) {
Thread.sleep(1000);
criu.checkpoint(image); // checkpoint and entry point for a restore
long pid = ProcessHandle.current().pid()
System.out.println("my PID: "+pid+" checkpont# "+n++);
}
}
}
The above example is somewhat similar to the simple bash script. The main difference is that the Java program is checkpointing itself every second. This allows us to CTRL+C any time - the program will keep counting and checkpointing where it left of, if restored.
[mbien@longbow JCRIUTest]$ sudo sh start-demo.sh
WARNING: Using incubator modules: jdk.incubator.foreign
my PID: 16195 checkpont# 0
my PID: 16195 checkpont# 1
my PID: 16195 checkpont# 2
my PID: 16195 checkpont# 3
my PID: 16195 checkpont# 4
my PID: 16195 checkpont# 5
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 5
my PID: 16195 checkpont# 6
my PID: 16195 checkpont# 7
my PID: 16195 checkpont# 8
my PID: 16195 checkpont# 9
CTRL+C
[mbien@longbow JCRIUTest]$ sudo criu restore --shell-job -D checkpoint_test_image/
my PID: 16195 checkpont# 9
my PID: 16195 checkpont# 10
my PID: 16195 checkpont# 11
my PID: 16195 checkpont# 12
my PID: 16195 checkpont# 13
my PID: 16195 checkpont# 14
CTRL+C
Note: start-demo.sh is just setting env variables to an early access JDK 16 panama build, enables jdk.incubator.foreign
etc. The project README has the details.
Important Details and Considerations
- CRIU restores images with the same PIDs the processes had during checkpoint. This won't cause much trouble in containers since the namespace should be quite empty, but might conflict from time to time on a workstation. If the same image should be restored multiple times concurrently, it will have to run in its own PID namespace. This can be achieved with
sudo unshare -p -m -f [restore command]
. Seeman unshare
for details. - Opened files are not allowed to change (in size) between checkpoint and restore. If they do, the restore operation will fail. (watch out for log files, JFR repos, JVM perf data or temporary files)
- If the application established TCP connections you have to tell CRIU that via the
--tcp-established
flag (or similar named method in CRIUContext). CRIU will try to restore all connections in their correct states. wiki link to more options - The first checkpoint or restore after system boot can take a few seconds because CRIU has to gather information about the system configuration first; this information is cached for subsequent uses
- Some application dependent post-restore tasks might be required, for example keystore/cert replacement or RNG re-initialization (...)
- CRIU can't checkpoint resources it can't reach. A X Window or state stored on a GPU can't be dumped
- Migration should probably only be attempted between (very) similar systems and hardware
Instant Defrosting of Warmed-up JVMs
Lets take a look what you can do with super luminal, absolute zero, instant defrosting JCRIU (ok I'll stop ;)) when applied to my favorite dusty java web monolith: Apache Roller. I stopped the time this blog here would require to start on my workstation when loaded from a NVMe on JDK 16 + Jetty 9.4.34. (I consider it started when the website has loaded in the browser, not when the app server reports it started)
classic start: ~6.5 s
(for comparison: it takes about a minute to start on a Raspberry Pi 3b+, which is serving this page you are reading right now)
Now lets try this again. But this time Roller will warm itself up, generate RSS feeds, populate the in-memory cache, give the JIT a chance to compile hot paths, compact the heap by calling System.gc()
and finally shock frost itself via criu.checkpoint(...)
.
warmup(); // generates/caches landing page/RSS feeds and first 20 blog entries
System.gc(); // give the GC a chance to clean up unused objects before checkpoint
try (CRIUContext criu = CRIUContext.create()
.logLevel(WARNING).leaveRunning(false).tcpEstablished(true)) {
criu.checkpoint(imagePath); // checkpoint + exit
} catch (CRIUException ex) {
jfrlog.warn("post warmup checkpoint failed", ex);
}
(The uncompressed image size was between 500-600 MB during my tests, heap was set to 1 GB with ParallelGC active)
restore:
$ sudo time criu restore --shell-job --tcp-established -d -D blog_image/
real 0m0,204s
user 0m0,015s
sys 0m0,022s
instant defrosting: 204 ms
Note: -d
detaches the shell after the restore operation completed. Alternative way to measure defrosting time is by enabling verbose logging with -v
and comparing the last timestamp, this is slightly slower (+20ms) since CRIU tends to log a lot on lower log levels. Let me know if there is a better way of measuring this, but I double checked everything and the image loading speed would be well below the average read speed of my M.2 NVMe.
The blog is immediately reachable in the browser, served by a warmed-up JVM.
Conclusion && Discussion
CRIU is quite interesting for use cases where Java startup time matters. Quarkus for example moves slow framework initialization from startup to build time, native images with GraalVM further improve initialization by AOT compiling the application into a single binary, but this also sacrifices a little bit throughput. CRIU can be another tool in the toolbox to quickly map a running JVM with application into memory (no noteworthy code changes required).
The Foreign Linker API (JEP 389) is currently proposed as preview feature for OpenJDK 16, which is a major part of project Panama. However, to use JCRIU on older JDKs, another implementation for CRIUContext would be needed. A implementation which communicates via google protocol buffers with CRIU would completely avoid binding to the CRIU C-API for example.
The JVM would be in an excellent position to aid CRIU in many ways. It already is an operating system for Java/Bytecode based programs (soon even with its own implementation for threads) and knows how to drive itself to safe points (checkpointing an application which is under load is probably a bad idea), how to compact or resize the heap, invalidate code cache etc - I see great potential there.
Let me know what you think.
Thanks a lot to Adrian Reber (@adrian__reber) who patiently answered all my questions about CRIU.