Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Management
Labels:
None

NOTE: The following describes a research project and is not at this point a planned roadmap feature, even at the Experimental stability level.

There are a couple implementations out there that provide JVM checkpointing based on Linux's Checkpoint and Restore in Userspace (CRIU) functionality. This RFE is to provide Experimental stability level support for integrating WildFly with that JVM-level functionality.

The JVM implementations are:

OpenJ9 – https://blog.openj9.org/2022/09/26/fast-jvm-startup-with-openj9-criu-support/
The OpenJDK CRaC project – https://wiki.openjdk.org/display/crac

The former is part of IBM Semeru and is the foundation of Open Liberty's Liberty InstantOn feature. The latter is not part of the main OpenJDK codebase, so it is more of an experimental feature. However, Azul supports a VM that uses this technology.

Both provide a similar API to application code like WildFly – a mechanism to register application code with their checkpoint/restore feature that is invoked before a VM checkpoint begins and after the VM is restored. Because of this it is possible for WildFly to integrate with both, using a small amount of adapter code to adapt the bulk of its integration to the two similar APIs. (Note we may not in the end choose to support both, or may only progress one to higher stability levels, but at this stage this work is a research project, so I'm working with both.)

There are three possible approaches to this that I see. They involve different tradeoffs in terms of effort/risk/reward, where the reward is the fastest boot of a restored VM with the shortest time to being fully warmed up. The risks include the general risk that complexity increases the probability of bugs. But a more specific risk is the fact that different kinds of VM state bring different problems when creating or using a checkpoint, particularly the fact that security-sensitive values may be persisted to disk in the checkpoint The greater the amount of state in a checkpoint and the greater the types of code managing that state, the greater the risk.

The three approaches are:

1) Provide a simple callback interface to the ServerService that it will use to signal when the main boot op has completed Stage.MODEL. The CR feature will start a reload of the server in the "before checkpoint" call from the VM and wait for that signal. When it comes, block the boot op thread and return from the 'before checkpoint', allowing the checkpoint to proceed. When the 'after restore' call comes, unblock the boot op thread.

This is the simplest and safest, as the JVM state that is checkpointed is simplest, since very few runtime services are started when the checkpoint happens. It definitely speeds boot from the restored VM a lot, but not down to the desired sub-second levels when the OOTB standalone.xml is used and a reasonably complex deployment like the Quickstart kitchensink.war is deployed.

This approach also takes the checkpoint before most config model expression resolution is done, opening the possibility of users taking advantage of any API exposed by the checkpointing implementation to reset env vars and thus make use of restore-point-specific values. (We could also examine WF-specific features to reset system props, e.g. from a -p boot param properties file.)

This approach assumes subsystems are not doing anything problematic for the checkpoint in Stage.MODEL, and that they clean up state (e.g. static var state) properly when services stop during a reload. This isn't an entirely safe assumption, but it seems to me to be a manageable one.

2) Same idea as #1 (reload the server and wait for a signal that reload has reached a desired point), but try and better optimize deployments by doing as much deployment work as possible before the checkpoint. Create a new OperationContext.Stage DEPLOYMENT_INIT (real name TBD). Change all the add ops that register DUPS to register them in that phase, not in RUNTIME. (But they can't register other services.) During execution of that Phase install any DeploymentUnitServices and bring them through Phase.CONFIGURE_MODULE, or perhaps FIRST_MODULE_USE, or perhaps some new Phase at a similar point. Once that point has been reached, signal the callback interface.

The key point here is to try and do as much deployment work as possible before the checkpoint, without doing parts that bring too much complexity to the VM state.

This approach clearly requires more work and is more fragile, as it depends on subsystem authors doing things correctly and assumes that no DUPS in Phases before the checkpoint depend on services installed in Stage.RUNTIME.

It also reduces the size of the set of runtime config values that could be resolved against post-restore-specific env vars or system props, since the DEPLOYMENT_INIT stage handlers would need to resolve values used by the DUPs.

3) Instead of using reload, use a suspend/resume cycle. This would result in the best post-restore boot time by far, but means the checkpointed JVM has the maximum amount of unknowable state that could be problematic, including state managed by deployment code. We would highly likely need to add a subsystem integration API to allow subsystem to prepare for checkpoint and re-establish state post-restore. We'd probably also need to expose an API to deployments. It also means changing runtime config values in the restored JVM by driving expression resolution using different env vars and system props would not be possible.

Perhaps over time we could implement more than one of these strategies and allow the user to choose.

Note that if this experiment proves fruitful there may be a fair amount of 'tooling' to go along with what I describe above, e.g. to help in producing and using images that contain a checkpointed WildFly instance. Initially all I contemplate is a new 'criu' subsystem with a single 'triggerCheckpoint' management op. (AFAICT OpenJ9/Semeru provides no JVM level API for triggering a checkpoint, so the user would rely on this op to trigger a checkpoint. OpenJDK CRaC provides a jcmd command 'JDK.checkpoint', but they also have an internal API that our 'triggerCheckpoint' management op can trigger.)

Note that all of the 3 options allow arbitrary 'warm up' of the JVM before the checkpoint happens, e.g. to get classes loaded and JIT compilation done. They all start with a normal WF instance that can be exercised as desired before the checkpoint happens. They differ though in what happens to 'warmed up' deployment code – the reload approaches would discard it; the suspend/resume approach would retain it. By deployment code I mean code packaged in the deployment archive. Warmed up static module code that is generally only used by deployments (say Hibernate or RESTEasy) would be retained. Note that simply reloading the server a few times before triggering the checkpoint is a beneficial form of warmup. (Try that on a normal WildFly install; you'll likely see the reported "started in" time go down relative to the first reload.)

is blocked by

WFCORE-6661 More accurate, resettable start time calculation

Resolved

Details

Description

Attachments

Issue Links

Activity

People

Dates