Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: rhos-19.0.0, 2025.1 (Upstream E)
Affects Version/s: None
Component/s: openstack-nova
Labels:
None

Epic Name:
Graceful shutdown improvements
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OSPRH-120Compute Engineering Backlog
Color Status:
Not Selected
Dev Approval:
Proposed
Docs Approval:
Proposed
Epic Status:
To Do
Feature Link:
OSPRH-120 - Compute Engineering Backlog
PM Approval:
Proposed
QE Approval:
Proposed
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Nova like many other distributed services has many actors that work together within the system.

Graceful shutdown is important as k8s is free to kill Pods and recreate them any time. But nova does not have proper graceful shutdown support.
Known issues:

limited retry mechanism for long running tasks
in memory task management in the conductor

The same graceful shutdown support is important for automated update and upgrade features as those also require k8s Pod and EDPM container restarts.

The high level solution could take many forms:

a per binary endpoint to ask for graceful shutdown
a top level API endpoint to query the state or runign operations and disabel the acceptance of new requests form the public endpoint.
better docs on how to manually do this

For Dalmatian:

one backlog spec for the high level picture about what we need to do to have graceful shutdown
one spec to make oslo.messaging support selectively unsubscribing from RPC topics

From the Dalmatian PTG: https://etherpad.opendev.org/p/nova-dalmatian-ptg

(dansmith) Nova compute really can't do this without redesigning the RPC stuff a bit, since it's dependent on conductor.. It can't stop listening and finish in-progress things because it still has to listen to responses from conductor. We could continue listening but actively reject new requests, but it will cause casts to be dropped on the floor and then task state confusion will ensue. We could also try to stop the senders instead of stop listening, but that's also less good. I'm +1 on graceful shutdown for sure, but it's definitely a thorny problem.
(fwiesel): Aren't the responses coming back on a direct reply queue? Yes, but I think we don't have knobs through o.msg to be able to unsubscribe from (say) the compute queue but still create and listen for reply queues
(fwiesel) Yes, that would be the part that needs to get done. I think though, it is fairly contained (well, in oslo_messaging and maybe oslo_service?)
(gibi): thanks for the pointers, this seem like a good idea to investigate further.

Also the overall solution has dependency towards the eventlet removal work as if we start using thread pools for certain tasks then those pools can be used to disable incoming new tasks while the exiting tasks still running to completion

is duplicated by

OSPRH-31 nova need to provied a way to quiese long running operations to enabel updates and upgrades

Closed

links to

openstack-k8s-operators/edpm-ansible#774: Healthchecks

Assignee:: Unassigned

Reporter:: Balazs Gibizer

Team:: rhos-dfg-compute

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/04/25 3:32 PM

Updated:: 2024/11/14 5:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty