Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: rhos-18.0.17 FR 5
Affects Version/s: None
Component/s: cinder-operator
Labels:
None

Story Points:
5
Epic Link:
[BugEpic]: Tune worker processes number inside the pod service depending on the scale
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-storage-integration
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Sprint:
Pending Verification
sprint_count:
1
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

This use case was shared by the customer:
We have encountered multiple times a situation, where compute expensive api calls are taking service down. This happened for cinder-api, nova-api and nova-metadata.

What we observed is that in this case pod scaling doesn’t help. At least we were still facing service deterioration with 21 replicas of nova-api, which is probably not the best approach.

Most, if not all the services have 2 processes for WSGI and 1 worker on the api (snippets from nova operator below):

## In general we want nova-api to scale via k8s replicas but we need
## two processes per replica to always has a room for a healthecheck query
WSGIDaemonProcess {{ $endpt }} display-name={{ $endpt }} processes=2 threads=1 user=nova group=nova

if eq .service_name "nova-api"
# scaling should be done by running more pods
osapi_compute_workers=1

When the single worker is busy handling query it can fail the health check, which we also observed. So not only we are facing the problem with service responding slowly due to load, but also k8s killing the pods making the situation even worse.

After these observations we believe that these settings should be adjusted for bigger environments as ours. I don’t think that having more than 10 replicas is a valid pattern. Especially when in the end we consume more resources, instead of doing more work in single pods.
Expected behavior

Appropriate configuration of Worker processes to digest the load in case of a large scale deployment.

Device Info (please complete the following information):

RHOSO 18.0.7
Distributed Zones architecture, with 3 AZs, 3 Cells
900+ Compute nodes
Stretched control plane

Bug impact

In case of large number of requests, service slowness due to the load, and k8s killing the pods.

is cloned by

OSPRH-25717 Tune worker processes number inside the pod service depending on the scale

Refinement

links to

https://github.com/openstack-k8s-operators/cinder-operator/pull/600

Assignee:: Francesco Pantano

Reporter:: Soumaya Msallem

Team:: rhos-storage-integration

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2026/01/26 5:43 PM

Updated:: 2026/02/24 3:27 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty