Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-139

Reboot storm during cluster-restore caused by SNR

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      When the API Server experiences issues, the Self Node Remediation (SNR) Operator might trigger unnecessary reboots of the nodes. (ECOPROJECT-2642)

      1) Cause: When the API server experiences issues, it might have caused the SNR peer check to fail. A request timeout on the peer server side did not always work as expected, and the peer’s response was sent too late. This might have caused the peer client on the node to think that it is isolated, and to trigger a reboot.
      2) Consequence: Nodes are unnecessarily rebooted when the API server experienced issues.
      3) Fix: The timeout handling on the peer check server was enhanced.
      4) Result: API server issues no longer result in unnecessary node reboots.
      Show
      When the API Server experiences issues, the Self Node Remediation (SNR) Operator might trigger unnecessary reboots of the nodes. ( ECOPROJECT-2642 ) 1) Cause: When the API server experiences issues, it might have caused the SNR peer check to fail. A request timeout on the peer server side did not always work as expected, and the peer’s response was sent too late. This might have caused the peer client on the node to think that it is isolated, and to trigger a reboot. 2) Consequence: Nodes are unnecessarily rebooted when the API server experienced issues. 3) Fix: The timeout handling on the peer check server was enhanced. 4) Result: API server issues no longer result in unnecessary node reboots.
    • Bug Fix
    • In Progress

      During a PoC we trying to proof a cluster-store and it failed because of a reboot storm caused by SNR.

       

      Cluster design/size:

      • 3 schedule able control plane nodes
      • 4 worker nodes

       

      Versions: OpenShift v4.17 

      Details follow with must-gather

       

      Our procedure:

      1. Power off / hard shutdown 2 of 3 control plane nodes
        => API is not available any more, as expected
      2. Start the restore based on our documentation: 5.3.2. Restoring to a previous cluster state 
      3. Etcd backup
      4. Start cluster-restore.sh
        SNR kicked in and rebooted the master node and all other nodes.

      => Test failed!

       

              slintes Marc Sluiter
              rbohne Robert Bohne
              Joachim von Thadden, Marc Schindler
              Votes:
              2 Vote for this issue
              Watchers:
              25 Start watching this issue

                Created:
                Updated:
                Resolved: