Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.20
Component/s: Networking / kubernetes-nmstate
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
True
Blocked Reason:

Hide

Caused by NMT-1617

Show
Caused by NMT-1617
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Observed in CI job for the OVN-Kubernetes BGP integration feature:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528

One of the test cases deploys nmstate. During or shortly after deployment, liveness probe of several nmstate handler pods fail. In one node in particular, it causes the restart of the handler pod. It is also observed that a NM rollback is performed on that node, destroying a lot of network configuration that happened since initial deployment and that was not managed with nmstate whatsoever.

From the api events https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/baremetalds-e2e-test/artifacts/junit/e2e-events_20250519-111636.json we can observe liveness probe failing and restart

        {
            "level": "Warning",
            "source": "KubeEvent",
            "locator": {
                "type": "Kind",
                "keys": {
                    "hmsg": "b2ec70b04b",
                    "namespace": "openshift-nmstate",
                    "node": "worker-1.ostest.test.metalkube.org",
                    "pod": "nmstate-handler-xz77k"
                }
            },
            "message": {
                "reason": "Unhealthy",
                "cause": "",
                "humanMessage": "Liveness probe failed: command timed out",
                "annotations": {
                    "count": "5",
                    "firstTimestamp": "2025-05-19T12:09:00Z",
                    "lastTimestamp": "2025-05-19T12:13:00Z",
                    "reason": "Unhealthy"
                }
            },
            "from": "2025-05-19T12:13:00Z",
            "to": "2025-05-19T12:13:01Z"

        {
            "level": "Info",
            "source": "KubeEvent",
            "locator": {
                "type": "Kind",
                "keys": {
                    "hmsg": "63aa48e85a",
                    "namespace": "openshift-nmstate",
                    "node": "worker-1.ostest.test.metalkube.org",
                    "pod": "nmstate-handler-xz77k"
                }
            },
            "message": {
                "reason": "Killing",
                "cause": "",
                "humanMessage": "Container nmstate-handler failed liveness probe, will be restarted",
                "annotations": {
                    "container": "nmstate-handler",
                    "firstTimestamp": "2025-05-19T12:13:00Z",
                    "lastTimestamp": "2025-05-19T12:13:00Z",
                    "reason": "Killing"
                }
            },
            "from": "2025-05-19T12:13:00Z",
            "to": "2025-05-19T12:13:00Z"
        },2025-05-16T16:21:33Z"
        },

Restarted pod logs show no issue
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/pods/openshift-nmstate_nmstate-handler-xz77k_nmstate-handler_previous.log
but regardless, the liveness porbe is just doing "nmstatectl show" which stragely has nothing to do with the liveness or health of the nmstate handler pod itself.

The way the liveness probe is setup makes it difficult to troubleshoot exactly why it failed. The node journal
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29803/pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/1924394090791702528/artifacts/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview/gather-extra/artifacts/nodes/worker-1.ostest.test.metalkube.org/journal
does not showcase any issue on the NM level.

However on that same journal we can see more serious consequences that what a simple pod restar should have. It looks like a rollback is being performed which causes configuration, even configuration not managed with nmstate, to be destroyed:

May 19 12:15:24.914695 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656924.9146] checkpoint[0x55e1fec45950]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1
...
May 19 12:15:24.954682 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656924.9546] device (extranet): detached VRF port enp3s0
...
May 19 12:15:25.357489 worker-1.ostest.test.metalkube.org NetworkManager[1170]: <info>  [1747656925.3574] device (test-net-4dxzf): detached VRF port ovn-k8s-mp56

enp3s0 is detached from extranet VRF, which might make sense since we have nmstate policy for that.
But also ovn-k8s-mp56 is detached from test-net-4dxzf VRF, which is something that while being managed with NM, is not managed with nmstate.
This interferes with other tests running in paralell.

So in this situation, I think a cluster operator would like to know without too much complication:
why did the liveness probe fail?
why did a rollback happen?
I am trying but probably need help finding answers to these two questions.

duplicates

OCPBUGS-29266 nmstate-handler failed a probe

ON_QA

is caused by

RHEL-93154 nmstatectl show stuck forever

Closed

is depended on by

CORENET-6015 BGP External Issue tracker

In Progress

links to

nmstatectl show root cause issue

Assignee:: Mat Kowalski

Reporter:: Jaime Caamaño Ruiz

Need Info From:: None

Contributors:: None

QA Contact:: Ross Brattain

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/05/20 11:11 AM

Updated:: 2025/07/13 1:08 PM

Resolved:: 2025/05/22 3:22 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates