Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56488

knmstate liveness probe fails and causes a destructive NM rollback

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.20
    • None
    • Quality / Stability / Reliability
    • True
    • Hide

      Caused by NMT-1617

      Show
      Caused by NMT-1617
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Observed in CI job for the OVN-Kubernetes BGP integration feature:
      https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528

      One of the test cases deploys nmstate. During or shortly after deployment, liveness probe of several nmstate handler pods fail. In one node in particular, it causes the restart of the handler pod. It is also observed that a NM rollback is performed on that node, destroying a lot of network configuration that happened since initial deployment and that was not managed with nmstate whatsoever.

      From the api events https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/baremetalds-e2e-test/artifacts/junit/e2e-events_20250519-111636.json we can observe liveness probe failing and restart

              {
                  "level": "Warning",
                  "source": "KubeEvent",
                  "locator": {
                      "type": "Kind",
                      "keys": {
                          "hmsg": "b2ec70b04b",
                          "namespace": "openshift-nmstate",
                          "node": "worker-1.ostest.test.metalkube.org",
                          "pod": "nmstate-handler-xz77k"
                      }
                  },
                  "message": {
                      "reason": "Unhealthy",
                      "cause": "",
                      "humanMessage": "Liveness probe failed: command timed out",
                      "annotations": {
                          "count": "5",
                          "firstTimestamp": "2025-05-19T12:09:00Z",
                          "lastTimestamp": "2025-05-19T12:13:00Z",
                          "reason": "Unhealthy"
                      }
                  },
                  "from": "2025-05-19T12:13:00Z",
                  "to": "2025-05-19T12:13:01Z"
      
              {
                  "level": "Info",
                  "source": "KubeEvent",
                  "locator": {
                      "type": "Kind",
                      "keys": {
                          "hmsg": "63aa48e85a",
                          "namespace": "openshift-nmstate",
                          "node": "worker-1.ostest.test.metalkube.org",
                          "pod": "nmstate-handler-xz77k"
                      }
                  },
                  "message": {
                      "reason": "Killing",
                      "cause": "",
                      "humanMessage": "Container nmstate-handler failed liveness probe, will be restarted",
                      "annotations": {
                          "container": "nmstate-handler",
                          "firstTimestamp": "2025-05-19T12:13:00Z",
                          "lastTimestamp": "2025-05-19T12:13:00Z",
                          "reason": "Killing"
                      }
                  },
                  "from": "2025-05-19T12:13:00Z",
                  "to": "2025-05-19T12:13:00Z"
              },2025-05-16T16:21:33Z"
              },
      

      Restarted pod logs show no issue
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/pods/openshift-nmstate_nmstate-handler-xz77k_nmstate-handler_previous.log
      but regardless, the liveness porbe is just doing "nmstatectl show" which stragely has nothing to do with the liveness or health of the nmstate handler pod itself.

      The way the liveness probe is setup makes it difficult to troubleshoot exactly why it failed. The node journal
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/nodes/worker-1.ostest.test.metalkube.org/journal
      does not showcase any issue on the NM level.

      However on that same journal we can see more serious consequences that what a simple pod restar should have. It looks like a rollback is being performed which causes configuration, even configuration not managed with nmstate, to be destroyed:

      May 19 12:15:24.914695 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656924.9146] checkpoint[0x55e1fec45950]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1
      ...
      May 19 12:15:24.954682 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656924.9546] device (extranet): detached VRF port enp3s0
      ...
      May 19 12:15:25.357489 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656925.3574] device (test-net-4dxzf): detached VRF port ovn-k8s-mp56
      

      enp3s0 is detached from extranet VRF, which might make sense since we have nmstate policy for that.
      But also ovn-k8s-mp56 is detached from test-net-4dxzf VRF, which is something that while being managed with NM, is not managed with nmstate.
      This interferes with other tests running in paralell.

      So in this situation, I think a cluster operator would like to know without too much complication:
      why did the liveness probe fail?
      why did a rollback happen?
      I am trying but probably need help finding answers to these two questions.

              mkowalsk@redhat.com Mat Kowalski
              jcaamano@redhat.com Jaime CaamaƱo Ruiz
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: