[OSPRH-12949] BZ#2327781 [OSP16.2] During FFU the overcloud upgrade run failed on networkers role node/s due to Error: invalid value all for cpuset cpus

Type: Bug
Resolution: Cannot Reproduce
Priority: Undefined
Fix Version/s: rhos-16.2.z
Affects Version/s: rhos-16.2.z
Component/s: puppet-ovn
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2327781
Regression:
None
Intelligence Requested:
Market:

Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

During FFU, and After running the the overcloud upgrade run, the ovn_controller container fails to start with the below error:

~~~
"ERROR: Container ovn_controller exited with code 125 when runed\nstderr: Error: invalid value all for cpuset cpus\n"]}
~~~

it seems it comes from here:
$ cat hashed-ovn_controller.json
{
"cpuset_cpus": "all", <======
"depends_on": [
"openvswitch.service"
],
"environment": {
"KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS"

Checking the Cu templates, The parameter OVNContainerCpusetCpus is defined for all Compute roles (various roles), but not defined for controllers nor Networker nodes (where the issue is happening).

To overcome the issue (pass the error), the parameter cpuset_cpus:0 was set 'manually' to '0' to all issued nodes on the hashed ovn_controller file., and then overcloud upgrade run

Cu needs to know where the cpuset_cpus": "all" came from ?
How to fix this issue without manually change this parameter or through templates ?
Why did the error pass and then ovn_controller become up after defining this parameter to cpuset_cpus": "0" ?.

Version-Release number of selected component (if applicable):
openstack-ovn-controller:16.2.6

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
After running the the overcloud upgrade run, the ovn_controller container fails to start

Expected results:
the ovn_controller become up with the default cpuset_cpus value After running the the overcloud upgrade run step

external trackers

Red Hat Customer Portal 03990646

Juan Payno added a comment - 2025/02/28 12:40 PM

Not sure how this happends. I think this is a specific coincidences on the environment.

If that is not the case do not hesitate to reopen the Jira or re-open a new one. Link to this.

Juan Payno added a comment - 2025/02/28 12:40 PM Not sure how this happends. I think this is a specific coincidences on the environment. If that is not the case do not hesitate to reopen the Jira or re-open a new one. Link to this.

Dave Hill added a comment - 2025/01/28 2:04 AM

fixed the issue by manually editing the /var/lib/tripleo-config ovn-controller file as well as the /etc/puppet/hieradata file ... I'm not sure both are required but given this customer have been waiting for quite some time for a solution, I've provided one for that. I've also made sure NetworkParameters had the OVNContainerCpusetCpus: '' value but I'm not sure that was required either ... somehow, none of the hiera/paunch files were updated and last update (according to .tripleo/hostory) was the last 16.2.6 deploy. Hopefully, this is now fully resolved . I've suggested the customer to remove the "all" value in their templates before upgrading their prod environment.

Dave Hill added a comment - 2025/01/28 2:04 AM fixed the issue by manually editing the /var/lib/tripleo-config ovn-controller file as well as the /etc/puppet/hieradata file ... I'm not sure both are required but given this customer have been waiting for quite some time for a solution, I've provided one for that. I've also made sure NetworkParameters had the OVNContainerCpusetCpus: '' value but I'm not sure that was required either ... somehow, none of the hiera/paunch files were updated and last update (according to .tripleo/hostory) was the last 16.2.6 deploy. Hopefully, this is now fully resolved . I've suggested the customer to remove the "all" value in their templates before upgrading their prod environment.

Assignee:: Juan Payno

Reporter:: RH Bugzilla Integration

QA Contact:: Archana Singh

Team:: rhos-dfg-upgrades

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/11/21 11:39 AM

Updated:: 2025/02/28 12:40 PM

Resolved:: 2025/02/28 12:40 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Juan Payno added a comment - 2025/02/28 12:40 PM

Expand comment: Juan Payno added a comment - 2025/02/28 12:40 PM

Collapse comment: Dave Hill added a comment - 2025/01/28 2:04 AM

Expand comment: Dave Hill added a comment - 2025/01/28 2:04 AM

People

Dates

PagerDuty