Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-12949

BZ#2327781 [OSP16.2] During FFU the overcloud upgrade run failed on networkers role node/s due to Error: invalid value all for cpuset cpus

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Undefined Undefined
    • rhos-16.2.z
    • rhos-16.2.z
    • puppet-ovn
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Moderate

      Description of problem:

      During FFU, and After running the the overcloud upgrade run, the ovn_controller container fails to start with the below error:

      ~~~
      "ERROR: Container ovn_controller exited with code 125 when runed\nstderr: Error: invalid value all for cpuset cpus\n"]}
      ~~~

      it seems it comes from here:
      $ cat hashed-ovn_controller.json
      {
      "cpuset_cpus": "all", <======
      "depends_on": [
      "openvswitch.service"
      ],
      "environment": {
      "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS"

      Checking the Cu templates, The parameter OVNContainerCpusetCpus is defined for all Compute roles (various roles), but not defined for controllers nor Networker nodes (where the issue is happening).

      To overcome the issue (pass the error), the parameter cpuset_cpus:0 was set 'manually' to '0' to all issued nodes on the hashed ovn_controller file., and then overcloud upgrade run

      • Cu needs to know where the cpuset_cpus": "all" came from ?
      • How to fix this issue without manually change this parameter or through templates ?
      • Why did the error pass and then ovn_controller become up after defining this parameter to cpuset_cpus": "0" ?.

      Version-Release number of selected component (if applicable):
      openstack-ovn-controller:16.2.6

      How reproducible:
      NA

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:
      After running the the overcloud upgrade run, the ovn_controller container fails to start

      Expected results:
      the ovn_controller become up with the default cpuset_cpus value After running the the overcloud upgrade run step

            [OSPRH-12949] BZ#2327781 [OSP16.2] During FFU the overcloud upgrade run failed on networkers role node/s due to Error: invalid value all for cpuset cpus

            Juan Payno added a comment -

            Not sure how this happends. I think this is a specific coincidences on the environment.

            If that is not the case do not hesitate to reopen the Jira or re-open a new one. Link to this.

            Juan Payno added a comment - Not sure how this happends. I think this is a specific coincidences on the environment. If that is not the case do not hesitate to reopen the Jira or re-open a new one. Link to this.

            Dave Hill added a comment -

            fixed the issue by manually editing the /var/lib/tripleo-config ovn-controller file as well as the /etc/puppet/hieradata file ... I'm not sure both are required but given this customer have been waiting for quite some time for a solution, I've provided one for that.   I've also made sure NetworkParameters had the OVNContainerCpusetCpus: '' value but I'm not sure that was required either ... somehow, none of the hiera/paunch files were updated and last update (according to .tripleo/hostory) was the last 16.2.6 deploy.  Hopefully, this is now fully resolved .  I've suggested the customer to remove the "all" value in their templates before upgrading their prod environment.

            Dave Hill added a comment - fixed the issue by manually editing the /var/lib/tripleo-config ovn-controller file as well as the /etc/puppet/hieradata file ... I'm not sure both are required but given this customer have been waiting for quite some time for a solution, I've provided one for that.   I've also made sure NetworkParameters had the OVNContainerCpusetCpus: '' value but I'm not sure that was required either ... somehow, none of the hiera/paunch files were updated and last update (according to .tripleo/hostory) was the last 16.2.6 deploy.  Hopefully, this is now fully resolved .  I've suggested the customer to remove the "all" value in their templates before upgrading their prod environment.

              jbadiapa@redhat.com Juan Payno
              jira-bugzilla-migration RH Bugzilla Integration
              Archana Singh Archana Singh
              rhos-dfg-upgrades
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: