-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.13.z
-
Important
-
No
-
SDN Sprint 251, SDN Sprint 252
-
2
-
False
-
Description of problem:
Customer is updating a cluster running OpenshiftSDN from 4.12.45 to 4.13.34. During the upgrade, we observe that Routes for namespaces with NetworkPolicies do no longer work as expected (timeouts, Router returns HTTP 503) and traffic is blocked. After the upgrade finishes (Nodes are restarted with the new version), the traffic is working as expected again. Customer cluster is `platform: vsphere`.
The symptoms are the same as in OCPBUGS-28920. If a Route in a namespace with NetworkPolicies is accessed, this fails with HTTP 503:
curl -I https://nginx-unprivileged-2-poi-user-walds-dev.apps.cl1.ocp4-sandbox.example.com HTTP/1.1 503 Service Unavailable Content-Type: text/html Connection: close pragma: no-cache cache-control: private, max-age=0, no-cache, no-store Strict-Transport-Security: max-age=31536000
However after the upgrade completes, traffic works again as expected. During the upgrade, we observe that the labels on the "openshift-host-network" namespace seem to be correct:
$ oc get namespace openshift-host-network --show-labels NAME STATUS AGE LABELS openshift-host-network Active 23h kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=
It looks like the issue is related to a Node configuration. Workaround is to apply the labels described in https://access.redhat.com/solutions/7055050 for OpenshiftSDN.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.45
OpenShift Container Platform 4.13.34
How reproducible:
Was so far unable to reproduce the issue on AWS, customer cluster is on vSphere. Customer can reproduce it consistently.
When pausing the `worker` MachineConfigPool before the upgrade, the issue can be reproduced and the cluster can be kept in the non-working state.
Steps to Reproduce:
1. Install a cluster with OCP 4.12.45 with OpenshiftSDN on vSphere
2. Create an application and create a NetworkPolicy allowing traffic from OpenShift Ingress ("allow-from-openshift-ingress") using the "network.openshift.io/policy-group: ingress" label
3. Observe that the application is reachable via the application Route
4. Pause the MachineConfigPool for workers: `oc patch mcp/worker --type merge --patch '{"spec":{"paused":true}}'`
5. Start the upgrade to OCP 4.13.34
6. Wait until the Cluster Network Operator is updated
Actual results:
During the upgrade and while the "worker" MCP is paused, traffic to the Route results in HTTP 503.
Expected results:
A short timeframe where traffic does not work is expected. However, this should be less than a minute. Traffic should not be blocked once all Cluster Operators have finished updating.
Additional info:
- must-gather from a broken cluster is available in Support Case 03770009 (comment #22)
- sosreport before the upgrade for a node is available in Support Case 03770009 (comment #24)
- sosreport after the upgrade for a node is available in Support Case 03770009 (comment #25)