-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.20
-
None
-
Quality / Stability / Reliability
-
True
-
-
None
-
Important
-
None
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Observed in CI job for the OVN-Kubernetes BGP integration feature:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528
One of the test cases deploys nmstate. During or shortly after deployment, liveness probe of several nmstate handler pods fail. In one node in particular, it causes the restart of the handler pod. It is also observed that a NM rollback is performed on that node, destroying a lot of network configuration that happened since initial deployment and that was not managed with nmstate whatsoever.
From the api events https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/baremetalds-e2e-test/artifacts/junit/e2e-events_20250519-111636.json we can observe liveness probe failing and restart
{ "level": "Warning", "source": "KubeEvent", "locator": { "type": "Kind", "keys": { "hmsg": "b2ec70b04b", "namespace": "openshift-nmstate", "node": "worker-1.ostest.test.metalkube.org", "pod": "nmstate-handler-xz77k" } }, "message": { "reason": "Unhealthy", "cause": "", "humanMessage": "Liveness probe failed: command timed out", "annotations": { "count": "5", "firstTimestamp": "2025-05-19T12:09:00Z", "lastTimestamp": "2025-05-19T12:13:00Z", "reason": "Unhealthy" } }, "from": "2025-05-19T12:13:00Z", "to": "2025-05-19T12:13:01Z"
{ "level": "Info", "source": "KubeEvent", "locator": { "type": "Kind", "keys": { "hmsg": "63aa48e85a", "namespace": "openshift-nmstate", "node": "worker-1.ostest.test.metalkube.org", "pod": "nmstate-handler-xz77k" } }, "message": { "reason": "Killing", "cause": "", "humanMessage": "Container nmstate-handler failed liveness probe, will be restarted", "annotations": { "container": "nmstate-handler", "firstTimestamp": "2025-05-19T12:13:00Z", "lastTimestamp": "2025-05-19T12:13:00Z", "reason": "Killing" } }, "from": "2025-05-19T12:13:00Z", "to": "2025-05-19T12:13:00Z" },2025-05-16T16:21:33Z" },
Restarted pod logs show no issue
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/pods/openshift-nmstate_nmstate-handler-xz77k_nmstate-handler_previous.log
but regardless, the liveness porbe is just doing "nmstatectl show" which stragely has nothing to do with the liveness or health of the nmstate handler pod itself.
The way the liveness probe is setup makes it difficult to troubleshoot exactly why it failed. The node journal
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/nodes/worker-1.ostest.test.metalkube.org/journal
does not showcase any issue on the NM level.
However on that same journal we can see more serious consequences that what a simple pod restar should have. It looks like a rollback is being performed which causes configuration, even configuration not managed with nmstate, to be destroyed:
May 19 12:15:24.914695 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info> [1747656924.9146] checkpoint[0x55e1fec45950]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1 ... May 19 12:15:24.954682 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info> [1747656924.9546] device (extranet): detached VRF port enp3s0 ... May 19 12:15:25.357489 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info> [1747656925.3574] device (test-net-4dxzf): detached VRF port ovn-k8s-mp56
enp3s0 is detached from extranet VRF, which might make sense since we have nmstate policy for that.
But also ovn-k8s-mp56 is detached from test-net-4dxzf VRF, which is something that while being managed with NM, is not managed with nmstate.
This interferes with other tests running in paralell.
So in this situation, I think a cluster operator would like to know without too much complication:
why did the liveness probe fail?
why did a rollback happen?
I am trying but probably need help finding answers to these two questions.
- duplicates
-
OCPBUGS-29266 nmstate-handler failed a probe
-
- ON_QA
-
- is caused by
-
RHEL-93154 nmstatectl show stuck forever
-
- Release Pending
-
- is depended on by
-
CORENET-6015 BGP External Issue tracker
-
- In Progress
-
- links to