Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-13

SNR doesn't reboot the node when the network is shutted down

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • True
    • Hide
      When a node has Self Node Remediation (SNR) and Node Health Check (NHC) enabled, the node will not reboot when the network is shutdown. (RHWA-13)

      1) Cause: An empty Pod IP was included in the list of peers to contact, for the api-check connectivity status
      2) Consequence: The api-check request was sent to localhost (default, in case of an empty IP), and this resulted in the pod contacting itself.
      3) Fix: Ensure empty Pod IPs are excluded from the list of peers to be checked.
      4) Result: The pod can only contact other peers to verify connectivity.
      Show
      When a node has Self Node Remediation (SNR) and Node Health Check (NHC) enabled, the node will not reboot when the network is shutdown. ( RHWA-13 ) 1) Cause: An empty Pod IP was included in the list of peers to contact, for the api-check connectivity status 2) Consequence: The api-check request was sent to localhost (default, in case of an empty IP), and this resulted in the pod contacting itself. 3) Fix: Ensure empty Pod IPs are excluded from the list of peers to be checked. 4) Result: The pod can only contact other peers to verify connectivity.
    • Bug Fix
    • Done
    • Critical

      Hello Michael, 

      as suggested i'm opening this one.

      Our customer is not getting the node rebooted when they shutdown the network in a node with NHC and SNR in place.

      Basically these are the steps:

      1. they shutdown the interface in the node
      2. remediation seems to take place, the node is drained  but NOT rebooted
      3. once the connectivity is restored the node is rebooted

      What you've found during the troubleshooting analysing the snr logs, is that the node is somehow still communicating with peers, and so is not marked as isolated and rebooted.

      The customer tried to disable completely the networking in the node, and the behaviour is the same.

      I'm attaching in the Jira the last snr agent log.

      In that one interface shutdown has been done on 
      Wed Apr 23 08:32:15 UTC 2025
      Customer noticed these lines

       

      INFO api-check getting health status from peer {"IP": ""}
      INFO api-check.peerhealth client new peer client {"serveraddr": ":30001"}
      ...
      INFO api-check got response from peer {"IP": "", "status": 3}

      and he's asking if this communication attempt could be the cause of the node not marked unhealthy.

      Let me know if you need anything else from the customer end.

      We already have and shared with you a consistent amount of data.

       

              rh-ee-clobrano Carlo Lobrano
              rhn-support-ldavidde Luca Davidde
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: