Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31456

During OCP 4.12.45 - 4.13.34 upgrade with OpenshiftSDN, Namespaces with NetworkPolicies are not reachable via router

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 251, SDN Sprint 252
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Customer is updating a cluster running OpenshiftSDN from 4.12.45 to 4.13.34. During the upgrade, we observe that Routes for namespaces with NetworkPolicies do no longer work as expected (timeouts, Router returns HTTP 503) and traffic is blocked. After the upgrade finishes (Nodes are restarted with the new version), the traffic is working as expected again. Customer cluster is `platform: vsphere`.

      The symptoms are the same as in OCPBUGS-28920. If a Route in a namespace with NetworkPolicies is accessed, this fails with HTTP 503:

      curl -I https://nginx-unprivileged-2-poi-user-walds-dev.apps.cl1.ocp4-sandbox.example.com
      HTTP/1.1 503 Service Unavailable
      Content-Type: text/html
      Connection: close
      pragma: no-cache
      cache-control: private, max-age=0, no-cache, no-store
      Strict-Transport-Security: max-age=31536000

      However after the upgrade completes, traffic works again as expected. During the upgrade, we observe that the labels on the "openshift-host-network" namespace seem to be correct:

      $ oc get namespace openshift-host-network --show-labels
      NAME                     STATUS   AGE   LABELS
      openshift-host-network   Active   23h   kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=

      It looks like the issue is related to a Node configuration. Workaround is to apply the labels described in https://access.redhat.com/solutions/7055050 for OpenshiftSDN.

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.12.45
      OpenShift Container Platform 4.13.34

      How reproducible:

      Was so far unable to reproduce the issue on AWS, customer cluster is on vSphere. Customer can reproduce it consistently.

      When pausing the `worker` MachineConfigPool before the upgrade, the issue can be reproduced and the cluster can be kept in the non-working state.

      Steps to Reproduce:

      1. Install a cluster with OCP 4.12.45 with OpenshiftSDN on vSphere
      2. Create an application and create a NetworkPolicy allowing traffic from OpenShift Ingress ("allow-from-openshift-ingress") using the "network.openshift.io/policy-group: ingress" label
      3. Observe that the application is reachable via the application Route
      4. Pause the MachineConfigPool for workers: `oc patch mcp/worker --type merge --patch '{"spec":{"paused":true}}'`
      5. Start the upgrade to OCP 4.13.34
      6. Wait until the Cluster Network Operator is updated

      Actual results:

      During the upgrade and while the "worker" MCP is paused, traffic to the Route results in HTTP 503.

      Expected results:

      A short timeframe where traffic does not work is expected. However, this should be less than a minute. Traffic should not be blocked once all Cluster Operators have finished updating.

      Additional info:

      • must-gather from a broken cluster is available in Support Case 03770009 (comment #22)
      • sosreport before the upgrade for a node is available in Support Case 03770009 (comment #24)
      • sosreport after the upgrade for a node is available in Support Case 03770009 (comment #25)

              npinaeva@redhat.com Nadia Pinaeva
              rhn-support-skrenger Simon Krenger
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: