Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8070

Egress router pods in pending state post upgrading cluster to 4.11

    • +
    • Critical
    • Yes
    • SDN Sprint 232, SDN Sprint 233, SDN Sprint 234, SDN Sprint 235
    • 4
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Customer Escalated

      Description of problem:

      After upgrading cluster from 4.10.47 to 4.11.25 issue is observed with Egress router pod, pods are in pending state. 

      Version-Release number of selected component (if applicable):

      4.11.25

      How reproducible:

       

      Steps to Reproduce:

      1. Upgrade from 4.10.47 to 4.11.25
      2. Check if co network is in Managed state
      3. Verify that egress pods are not created with errors like :
      55s         Warning   FailedCreatePodSandBox   pod/******     (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox *******_d6918859-a4e9-4e5b-ba44-acc70499fa7c_0(9c464935ebaeeeab7be0b056c3f7ed1b7279e21445b9febea29eb280f7ee7429): error adding pod ****** to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ns/pod/d6918859-a4e9-4e5b-ba44-acc70499fa7c:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": unknown FS magic on "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": 1021994
       
      
      

      Actual results:

      Egress router pods in pending state with error message as below:
      $ omg get events 
      ...
      49s        Warning  FailedCreatePodSandBox  pod/xxxx  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_xxxx_379fa7ec-4702-446c-9162-55c2f76989f6_0(86f8c76e9724216143bef024996cb14a7614d3902dcf0d3b7ea858298766630c): error adding pod xxx to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [xxxx/xxxx/379fa7ec-4702-446c-9162-55c2f76989f6:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": unknown FS magic on "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": 1021994

      Expected results:

      Egress router pods in running state

      Additional info:

      Workaround from https://access.redhat.com/solutions/6986283 works :
      Edit sdn DS in openshift-sdn namespace : 
      - mountPath: /host/var/run/netns <<<<< /var/run/netns
        mountPropagation: HostToContainer
        name: host-run-netns   
        readOnly: true 

            [OCPBUGS-8070] Egress router pods in pending state post upgrading cluster to 4.11

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            I found that solution provided in "https://access.redhat.com/solutions/6986283" does not recover egress-routers pods in our cluster. The only valid way to recover pods, is to delete SDN ds. After recreation by CNO,  ds does not contain "host-var-run-netns" mountpath at all. I would like to ask if such situation brings any additional risks. 

            Jakub Chalamonski (Inactive) added a comment - I found that solution provided in " https://access.redhat.com/solutions/6986283" does not recover egress-routers pods in our cluster. The only valid way to recover pods, is to delete SDN ds. After recreation by CNO,  ds does not contain "host-var-run-netns" mountpath at all. I would like to ask if such situation brings any additional risks. 

            Jean Chen added a comment -

            OCP-63155 was created and automated to add test coverage for this bug.

            Jean Chen added a comment - OCP-63155 was created and automated to add test coverage for this bug.

            rhn-support-sgurnale This will not get backported to 4.11, unless PMs say so because it happens so rarely. Upgrade is required to consume the fix.

            Martin Kennelly added a comment - rhn-support-sgurnale This will not get backported to 4.11, unless PMs say so because it happens so rarely. Upgrade is required to consume the fix.

            Weibin Liang added a comment - jechen@redhat.com As your reference: https://github.com/openshift/openshift-tests-private/blob/release-4.11/test/extended/networking/egressrouter.go#L34

            Zhanqi Zhao added a comment -

            cc anusaxen jechen@redhat.com weliang1@redhat.com  to see if can help verify this during China holiday, thanks

            Zhanqi Zhao added a comment - cc anusaxen jechen@redhat.com weliang1@redhat.com   to see if can help verify this during China holiday, thanks

            Update: Waiting on review.

            Martin Kennelly added a comment - Update: Waiting on review.

            Martin Kennelly added a comment - - edited

            I have isolated the cause of the issue. 

            OCP utilises server side apply [1]. Object's fields are tracked through "field management" [2] mechanism. When a field's value changes, ownership can be shared from its current manager to the manager making the change. CNO manager name changes from "cluster-network-operator" in 4.10.47 to "cluster-network-operator/operconfig" in 4.11.25. So when we attempt to apply a new config which doesn't contain some fields, it does not remove them because they are co-owned by a manager not named  "cluster-network-operator/operconfig", therefore the field will not be removed with a SSA "Apply" operation.

             [1] https://kubernetes.io/docs/reference/using-api/server-side-apply/

             [2] https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management

             

            Martin Kennelly added a comment - - edited I have isolated the cause of the issue.  OCP utilises server side apply [1] . Object's fields are tracked through "field management" [2] mechanism. When a field's value changes, ownership can be shared from its current manager to the manager making the change. CNO manager name changes from "cluster-network-operator" in 4.10.47 to "cluster-network-operator/operconfig" in 4.11.25. So when we attempt to apply a new config which doesn't contain some fields, it does not remove them because they are co-owned by a manager not named  "cluster-network-operator/operconfig", therefore the field will not be removed with a SSA "Apply" operation.   [1] https://kubernetes.io/docs/reference/using-api/server-side-apply/   [2] https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management  

            Talked to mika - agreed safest workaround is to unmanage CNO, edit the 'sdn' daemonset according to the solution article, and then place CNO into manage state again immediately. This will allow daemonset controller to rollout the changes and the security this brings.

            Martin Kennelly added a comment - Talked to mika - agreed safest workaround is to unmanage CNO, edit the 'sdn' daemonset according to the solution article, and then place CNO into manage state again immediately. This will allow daemonset controller to rollout the changes and the security this brings.

              mkennell@redhat.com Martin Kennelly
              rhn-support-sgurnale Sunil Gurnale
              Jean Chen Jean Chen
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated:
                Resolved: