[OCPBUGS-8070] Egress router pods in pending state post upgrading cluster to 4.11

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.0
Affects Version/s: 4.11
Component/s: Networking / openshift-sdn
Labels:

Test Coverage:

+
Severity:
Critical
Regression:
Yes
Sprint:
SDN Sprint 232, SDN Sprint 233, SDN Sprint 234, SDN Sprint 235
sprint_count:
4
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Customer Impact:

Customer Escalated
RH Private Keywords:
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

After upgrading cluster from 4.10.47 to 4.11.25 issue is observed with Egress router pod, pods are in pending state.

Version-Release number of selected component (if applicable):

4.11.25

How reproducible:

Steps to Reproduce:

1. Upgrade from 4.10.47 to 4.11.25
2. Check if co network is in Managed state
3. Verify that egress pods are not created with errors like :
55s         Warning   FailedCreatePodSandBox   pod/******     (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox *******_d6918859-a4e9-4e5b-ba44-acc70499fa7c_0(9c464935ebaeeeab7be0b056c3f7ed1b7279e21445b9febea29eb280f7ee7429): error adding pod ****** to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ns/pod/d6918859-a4e9-4e5b-ba44-acc70499fa7c:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": unknown FS magic on "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": 1021994

Actual results:

Egress router pods in pending state with error message as below:
$ omg get events 
...
49s        Warning  FailedCreatePodSandBox  pod/xxxx  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_xxxx_379fa7ec-4702-446c-9162-55c2f76989f6_0(86f8c76e9724216143bef024996cb14a7614d3902dcf0d3b7ea858298766630c): error adding pod xxx to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [xxxx/xxxx/379fa7ec-4702-446c-9162-55c2f76989f6:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": unknown FS magic on "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": 1021994

Expected results:

Egress router pods in running state

Additional info:

Workaround from https://access.redhat.com/solutions/6986283 works :
Edit sdn DS in openshift-sdn namespace : 
- mountPath: /host/var/run/netns <<<<< /var/run/netns
  mountPropagation: HostToContainer
  name: host-run-netns   
  readOnly: true

causes

OCPBUGS-28920 OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

Closed

relates to

OCPBUGS-3744 Egress router POD creation is failing while using openshift-sdn network plugin

Closed

OCPBUGS-3889 Egress router POD creation is failing while using openshift-sdn network plugin

Closed

OCPBUGS-3911 Egress router POD creation is failing while using openshift-sdn network plugin

Closed

links to

openshift/cluster-network-operator#1763: OCPBUGS-8070: Depreciate legacy field manager

RHEA-2023:5006 rpm

(1 links to)

Errata Tool added a comment - 2023/10/31 1:32 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:5006

Errata Tool added a comment - 2023/10/31 1:32 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

Errata Tool added a comment - 2023/10/31 12:57 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:5006

Errata Tool added a comment - 2023/10/31 12:57 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

Jakub Chalamonski (Inactive) added a comment - 2023/07/25 6:06 AM

I found that solution provided in "https://access.redhat.com/solutions/6986283" does not recover egress-routers pods in our cluster. The only valid way to recover pods, is to delete SDN ds. After recreation by CNO, ds does not contain "host-var-run-netns" mountpath at all. I would like to ask if such situation brings any additional risks.

Jakub Chalamonski (Inactive) added a comment - 2023/07/25 6:06 AM I found that solution provided in " https://access.redhat.com/solutions/6986283" does not recover egress-routers pods in our cluster. The only valid way to recover pods, is to delete SDN ds. After recreation by CNO, ds does not contain "host-var-run-netns" mountpath at all. I would like to ask if such situation brings any additional risks.

Jean Chen added a comment - 2023/07/10 2:37 AM

OCP-63155 was created and automated to add test coverage for this bug.

Jean Chen added a comment - 2023/07/10 2:37 AM OCP-63155 was created and automated to add test coverage for this bug.

Martin Kennelly added a comment - 2023/05/30 2:35 PM

rhn-support-sgurnale This will not get backported to 4.11, unless PMs say so because it happens so rarely. Upgrade is required to consume the fix.

Martin Kennelly added a comment - 2023/05/30 2:35 PM rhn-support-sgurnale This will not get backported to 4.11, unless PMs say so because it happens so rarely. Upgrade is required to consume the fix.

Weibin Liang added a comment - 2023/05/01 3:59 PM

jechen@redhat.com As your reference: https://github.com/openshift/openshift-tests-private/blob/release-4.11/test/extended/networking/egressrouter.go#L34

Weibin Liang added a comment - 2023/05/01 3:59 PM jechen@redhat.com As your reference: https://github.com/openshift/openshift-tests-private/blob/release-4.11/test/extended/networking/egressrouter.go#L34

Zhanqi Zhao added a comment - 2023/04/28 3:03 AM

cc anusaxen jechen@redhat.com weliang1@redhat.com to see if can help verify this during China holiday, thanks

Zhanqi Zhao added a comment - 2023/04/28 3:03 AM cc anusaxen jechen@redhat.com weliang1@redhat.com to see if can help verify this during China holiday, thanks

Martin Kennelly added a comment - 2023/04/12 12:37 PM

Update: Waiting on review.

Martin Kennelly added a comment - 2023/04/12 12:37 PM Update: Waiting on review.

Martin Kennelly added a comment - 2023/03/09 10:59 AM - edited

I have isolated the cause of the issue.

OCP utilises server side apply [1]. Object's fields are tracked through "field management" [2] mechanism. When a field's value changes, ownership can be shared from its current manager to the manager making the change. CNO manager name changes from "cluster-network-operator" in 4.10.47 to "cluster-network-operator/operconfig" in 4.11.25. So when we attempt to apply a new config which doesn't contain some fields, it does not remove them because they are co-owned by a manager not named "cluster-network-operator/operconfig", therefore the field will not be removed with a SSA "Apply" operation.

[1] https://kubernetes.io/docs/reference/using-api/server-side-apply/

[2] https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management

Martin Kennelly added a comment - 2023/03/09 10:59 AM - edited I have isolated the cause of the issue. OCP utilises server side apply [1] . Object's fields are tracked through "field management" [2] mechanism. When a field's value changes, ownership can be shared from its current manager to the manager making the change. CNO manager name changes from "cluster-network-operator" in 4.10.47 to "cluster-network-operator/operconfig" in 4.11.25. So when we attempt to apply a new config which doesn't contain some fields, it does not remove them because they are co-owned by a manager not named "cluster-network-operator/operconfig", therefore the field will not be removed with a SSA "Apply" operation. [1] https://kubernetes.io/docs/reference/using-api/server-side-apply/ [2] https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management

Martin Kennelly added a comment - 2023/03/03 1:21 PM

Talked to mika - agreed safest workaround is to unmanage CNO, edit the 'sdn' daemonset according to the solution article, and then place CNO into manage state again immediately. This will allow daemonset controller to rollout the changes and the security this brings.

Martin Kennelly added a comment - 2023/03/03 1:21 PM Talked to mika - agreed safest workaround is to unmanage CNO, edit the 'sdn' daemonset according to the solution article, and then place CNO into manage state again immediately. This will allow daemonset controller to rollout the changes and the security this brings.

Assignee:: Martin Kennelly

Reporter:: Sunil Gurnale

QA Contact:: Jean Chen

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2023/03/01 10:50 AM

Updated:: 2024/02/07 5:58 PM

Resolved:: 2023/10/31 12:57 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2023/10/31 1:32 PM

Expand comment: Errata Tool added a comment - 2023/10/31 1:32 PM

Collapse comment: Errata Tool added a comment - 2023/10/31 12:57 PM

Expand comment: Errata Tool added a comment - 2023/10/31 12:57 PM

Collapse comment: Jakub Chalamonski (Inactive) added a comment - 2023/07/25 6:06 AM

Expand comment: Jakub Chalamonski (Inactive) added a comment - 2023/07/25 6:06 AM

Collapse comment: Jean Chen added a comment - 2023/07/10 2:37 AM

Expand comment: Jean Chen added a comment - 2023/07/10 2:37 AM

Collapse comment: Martin Kennelly added a comment - 2023/05/30 2:35 PM

Expand comment: Martin Kennelly added a comment - 2023/05/30 2:35 PM

Collapse comment: Weibin Liang added a comment - 2023/05/01 3:59 PM

Expand comment: Weibin Liang added a comment - 2023/05/01 3:59 PM

Collapse comment: Zhanqi Zhao added a comment - 2023/04/28 3:03 AM

Expand comment: Zhanqi Zhao added a comment - 2023/04/28 3:03 AM

Collapse comment: Martin Kennelly added a comment - 2023/04/12 12:37 PM

Expand comment: Martin Kennelly added a comment - 2023/04/12 12:37 PM

Collapse comment: Martin Kennelly added a comment - 2023/03/09 10:59 AM, Edited by Martin Kennelly - 2023/03/28 2:12 PM

Expand comment: Martin Kennelly added a comment - 2023/03/09 10:59 AM, Edited by Martin Kennelly - 2023/03/28 2:12 PM

Collapse comment: Martin Kennelly added a comment - 2023/03/03 1:21 PM

Expand comment: Martin Kennelly added a comment - 2023/03/03 1:21 PM

People

Dates