Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.16.0
Affects Version/s: 4.13.z
Component/s: Networking / cluster-network-operator
Labels:

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z
Target Version:

4.16.0
Release Blocker:
Rejected
Sprint:
SDN Sprint 249
sprint_count:
1

Customer Impact:

Customer Escalated, Customer Facing
Products:

Red Hat OpenShift Container Platform

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

Description of problem:

- Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.

This manifests as the router pods marking backends as DOWN and dropping traffic; but The behavior can be replicated with curl outside of the HAProxy pods via entering a debug shell to a host node (or SSH) and curling the pod IP directly. A significant percentage of packets time out to the target backend on intermittent subsequent calls.
We narrowed the behavior down to the moment we applied the NetworkPolicy for `allow-from-ingress` as outlined below - immediately the namespace began to drop packets on a curl loop running from an infra node directly against the pod IP (some 2-3% of all calls timed out).

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
  metadata:
    name: allow-from-openshift-ingress
    namespace: testing
spec:
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
             policy-group.network.openshift.io/ingress: ""
    podSelector: {}
    policyTypes:
    - Ingress

Version-Release number of selected component (if applicable):

How reproducible:

every time, all namespaces with this network policy on this clusterversion (replicated on two clusters that underwent the same upgrade).

Steps to Reproduce:

1. Upgrade cluster to 4.13.30

2. Apply test pod running basic HTTP instance at random port

3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).

4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).

$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done

Actual results:

- Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.

Expected results:

- traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.

Additional info:

This behavior began immediately after completed upgrade from 4.13.24 to 4.13.30 and has been replicated on two separate clusters.
Customer has been forced to reinstall a cluster at downgraded version to ensure stability/deliverables for their user-base and this is a critical impact outage scenario for them

– additional required template details in first comment below.

RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with

oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=

# but before the fix you will see 
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=

Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged

oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
oc scale deployment network-operator -n openshift-network-operator --replicas=0

and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears

Potentially affected versions (may need to reproduce to confirm)

4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070

4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293

4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039

Mitigation/support KCS:
https://access.redhat.com/solutions/7055050

is blocked by

CORENET-4484 Impact of: allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

Closed

is caused by

OCPBUGS-22293 [4.13] CNO fails to apply ovnkube-master daemonset during upgrade

Closed

OCPBUGS-8070 Egress router pods in pending state post upgrading cluster to 4.11

Closed

is cloned by

OCPBUGS-29299 [4.15] OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

Closed

is depended on by

OCPBUGS-29299 [4.15] OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

Closed

is related to

OCPBUGS-29288 Ensure proper deprecation for the default field manager in CNO

Closed

is triggering

CORENET-3841 Corrective Measure for OCPBUGS-28920: OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

Closed

links to

KCS 7055050: Intermittent 503 from all routes or timeouts when calling to backends directly when `allow-from-openshift-ingress` network policy is applied

KCS https://access.redhat.com/solutions/7054994

openshift/cluster-network-operator#2259: OCPBUGS-28920: Update ingressconfig_controller to use field Manager

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(1 is related to, 1 is triggering, 4 links to)

Assignee:: Nadia Pinaeva (Inactive)

Reporter:: Will Russell

Need Info From:: None

Contributors:: None

QA Contact:: Jean Chen

Doc Contact:: None

Votes:: 15 Vote for this issue

Watchers:: 71 Start watching this issue

Created:: 2024/02/02 6:16 PM

Updated:: 2025/09/23 2:25 PM

Resolved:: 2024/06/27 11:34 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates