-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
Nova like many other distributed services has many actors that work together within the system.
Graceful shutdown is important as k8s is free to kill Pods and recreate them any time. But nova does not have proper graceful shutdown support.
Known issues:
- limited retry mechanism for long running tasks
- in memory task management in the conductor
The same graceful shutdown support is important for automated update and upgrade features as those also require k8s Pod and EDPM container restarts.
The high level solution could take many forms:
- a per binary endpoint to ask for graceful shutdown
- a top level API endpoint to query the state or runign operations and disabel the acceptance of new requests form the public endpoint.
- better docs on how to manually do this
For Dalmatian:
- one backlog spec for the high level picture about what we need to do to have graceful shutdown
- one spec to make oslo.messaging support selectively unsubscribing from RPC topics
From the Dalmatian PTG: https://etherpad.opendev.org/p/nova-dalmatian-ptg
(dansmith) Nova compute really can't do this without redesigning the RPC stuff a bit, since it's dependent on conductor.. It can't stop listening and finish in-progress things because it still has to listen to responses from conductor. We could continue listening but actively reject new requests, but it will cause casts to be dropped on the floor and then task state confusion will ensue. We could also try to stop the senders instead of stop listening, but that's also less good. I'm +1 on graceful shutdown for sure, but it's definitely a thorny problem.
(fwiesel): Aren't the responses coming back on a direct reply queue? Yes, but I think we don't have knobs through o.msg to be able to unsubscribe from (say) the compute queue but still create and listen for reply queues
(fwiesel) Yes, that would be the part that needs to get done. I think though, it is fairly contained (well, in oslo_messaging and maybe oslo_service?)
(gibi): thanks for the pointers, this seem like a good idea to investigate further.
Also the overall solution has dependency towards the eventlet removal work as if we start using thread pools for certain tasks then those pools can be used to disable incoming new tasks while the exiting tasks still running to completion
- is duplicated by
-
OSPRH-31 nova need to provied a way to quiese long running operations to enabel updates and upgrades
- Closed
- links to