Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-3147

Nova is out of sync with ironic as hypervisor list in nova does not agree with ironic list after reboot of the nodes

XMLWordPrintable

    • 3
    • False
    • False
    • openstack-nova-27.1.1-18.0.20230930093334.a869ab1.el9ost
    • None
    • Undefined
    • Moderate

      +++ This bug was initially created as a clone of Bug #1908405 +++

      Description of problem:
      In one of our phase3 16.2 job, After the reboot of the nodes , nova-hypervisor list doesn't list the ironic hypervisor created and this we see inconsistencies below:

      TASK [ensure nova is caught up with node status - get uuids] *******************
      changed: [undercloud-0] => {
      "changed": true,
      "cmd": "source /home/stack/overcloudrc\nopenstack baremetal node list -f value -c UUID\n",
      "delta": "0:00:03.576614",
      "end": "2020-12-08 03:22:41.492741",
      "rc": 0,
      "start": "2020-12-08 03:22:37.916127"
      }

      STDOUT:

      4a0dab7d-a5e9-4956-a5cb-c3714be5813f

      TASK [get nova count of nodes] *************************************************
      FAILED - RETRYING: get nova count of nodes (120 retries left).
      FAILED - RETRYING: get nova count of nodes (119 retries left).
      FAILED - RETRYING: get nova count of nodes (118 retries left).
      FAILED - RETRYING: get nova count of nodes (117 retries left).
      FAILED - RETRYING: get nova count of nodes (116 retries left).
      FAILED - RETRYING: get nova count of nodes (115 retries left).
      ...
      FAILED - RETRYING: get nova count of nodes (1 retries left).
      fatal: [undercloud-0]: FAILED! => {
      "attempts": 120,
      "changed": true,
      "cmd": "source /home/stack/overcloudrc\nnova hypervisor-list |grep -F 4a0dab7d-a5e9-4956-a5cb-c3714be5813f |wc -l\n",
      "delta": "0:00:02.199633",
      "end": "2020-12-08 03:48:55.366401",
      "rc": 0,
      "start": "2020-12-08 03:48:53.166768"
      }

      STDOUT:

      0

      which was clearly listed before the reboot of nodes:

      $nova hypervisor-list
      --------------------------------------------------------------------------------------+

      ID Hypervisor hostname State Status

      --------------------------------------------------------------------------------------+

      d2365afd-f5e8-4842-ae49-101e9b261b0b compute-1.redhat.local up enabled
      88f12987-e9d5-4d0d-966b-e24a16f294e7 compute-0.redhat.local up enabled
      4a0dab7d-a5e9-4956-a5cb-c3714be5813f 4a0dab7d-a5e9-4956-a5cb-c3714be5813f up enabled
      c2e8de2a-2d8c-4dcf-b0fa-ff3f70b4eecd c2e8de2a-2d8c-4dcf-b0fa-ff3f70b4eecd up enabled

      --------------------------------------------------------------------------------------+

      Version-Release number of selected component (if applicable):

      How reproducible:
      everytime

      Steps to Reproduce:
      1.create ironic node
      2.add them to the hypervisor list
      3.reboot the nodes using virsh destroy/start
      4. validate the hypervisor list

      Actual results:
      Nova hypervisor list does not agree with ironic list

      Expected results:
      Nova hypervisor list should be similar to ironic list

      Additional info:

      — Additional comment from Paras Babbar on 2020-12-16 16:00:32 UTC —

      Logs:

      http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_2comp_2ironic-ipv4-geneve-IR-OC_Ironic_UEFI-20200831-1303/3/

      — Additional comment from melanie witt on 2020-12-16 18:13:59 UTC —

      Not sure what's going on here but I see the following trace in the ironic app.log on all 3 controllers:

      2020-12-08 03:02:20.303 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Rebuilding cached hash rings ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:57
      2020-12-08 03:02:20.321 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Finished rebuilding hash rings, available drivers are :fake-hardware, :ipmi ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61
      2020-12-08 03:02:22.385 12 DEBUG wsme.api [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Client-side error: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed.
      Traceback (most recent call last):

      File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner
      return func(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3423, in vif_detach
      purpose='detach vif') as task:
      File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 161, in acquire
      return TaskManager(context, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 243, in _init_
      self.release_resources()
      File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in _exit_
      self.force_reraise()
      File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
      File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
      raise value
      File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 224, in _init_
      self._lock()
      File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 284, in _lock
      reserve_node()
      File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f
      return Retrying(*dargs, **dkw).call(f, *args, **kw)
      File "/usr/lib/python3.6/site-packages/retrying.py", line 229, in call
      raise attempt.get()
      File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get
      six.reraise(self.value[0], self.value[1], self.value[2])
      File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
      raise value
      File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call
      attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
      File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 277, in reserve_node
      self.node_id)
      File "/usr/lib/python3.6/site-packages/ironic/objects/node.py", line 340, in reserve
      db_node = cls.dbapi.reserve_node(tag, node_id)
      File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper
      ectxt.value = e.inner_exc
      File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in _exit_
      self.force_reraise()
      File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
      File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
      raise value
      File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper
      return f(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/ironic/db/sqlalchemy/api.py", line 450, in reserve_node
      host=node['reservation'])

      ironic.common.exception.NodeLocked: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed.
      format_exception /usr/lib/python3.6/site-packages/wsme/api.py:222

      and no occurrence of the uuid 4a0dab7d-a5e9-4956-a5cb-c3714be5813f in either nova compute.log.

      — Additional comment from melanie witt on 2020-12-17 01:05:57 UTC —

      After spending more time digging into the logs I think this bug is a duplicate of https://bugs.launchpad.net/nova/+bug/1853009 upstream.

      The tl;dr is that when there are multiple nova-compute services using ironic and they start at the same time and race, it is possible they will take ownership of one another's ironic nodes during a hash ring rebalance, resulting in deletion of compute nodes from nova that are not restored. (One nova-compute will delete the compute node out from under another nova-compute).

      I think the only possible workaround for this bug is to avoid starting multiple nova-computes using ironic at the same time and pause between the start of each one.

      Details from this job run:

      At first, node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f was owned by controller-0:

      2020-12-08 02:31:20.179 8 WARNING nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found.
      2020-12-08 02:31:20.236 8 INFO nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] Compute node record created for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f with uuid: 4a0dab7d-a5e9-4956-a5cb-c3714be5813f

      Then it moved to controller-2:

      2020-12-08 02:31:38.686 8 WARNING nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found.
      2020-12-08 02:31:38.702 8 INFO nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-0.redhat.local to controller-2.redhat.local
      [...]
      2020-12-08 02:31:38.820 8 DEBUG nova.virt.ironic.driver [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is not ready for a deployment, reporting resources as reserved for it. Node's provision state is manageable, power state is power off and maintenance is False. update_provider_tree /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:944
      [...]
      2020-12-08 02:31:38.918 8 DEBUG nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

      Then it moved to controller-1:

      2020-12-08 03:12:21.533 7 WARNING nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] No compute node record for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-1.redhat.local could not be found.
      2020-12-08 03:12:21.562 7 INFO nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-1.redhat.local
      [...]
      2020-12-08 03:12:22.883 7 DEBUG nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] Compute_service record updated for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

      Then it moved to controller-2:

      2020-12-08 03:12:21.700 7 WARNING nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found.
      2020-12-08 03:12:21.729 7 INFO nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-1.redhat.local to controller-2.redhat.local
      [...]
      2020-12-08 03:12:22.118 7 DEBUG nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

      Controller-2 is seen as down.

      Then it moved to controller-0:

      2020-12-08 03:12:23.881 7 WARNING nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found.
      2020-12-08 03:12:23.900 7 INFO nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-0.redhat.local
      [...]
      2020-12-08 03:12:24.769 7 DEBUG nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] Compute_service record updated for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

      Then controller-0 thought it was owned by controller-2 and deleted its compute node. controller-0 nova-compute.log:

      2020-12-08 03:12:31.096 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Hash ring members are

      {'controller-0.redhat.local', 'controller-2.redhat.local'}

      _refresh_hash_ring /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:795
      2020-12-08 03:12:31.174 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Returning 0 available node(s) get_available_nodes /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:885
      2020-12-08 03:12:31.192 7 INFO nova.compute.manager [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Deleting orphan compute node 8 hypervisor host is 4a0dab7d-a5e9-4956-a5cb-c3714be5813f, nodes are set()

      Finally, no nova-compute ever re-created the compute node, so it never appears in the hypervisor list again.

      I will work on refreshing the upstream patches and clone this rhbz for OSP17 where the master changes will land.

      — Additional comment from RHEL Program Management on 2020-12-17 01:06:00 UTC —

      This item has been properly Triaged and planned for the release, and Target Release is now set to match the release flag. For details, see https://mojo.redhat.com/docs/DOC-1195410

              mwitt@redhat.com melanie witt
              jira-bugzilla-migration RH Bugzilla Integration
              rhos-workloads-compute
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: