Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28920

OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)


    • adding 'allow-host-network' policy is a work around

      Description of problem:

       - Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.

      • This manifests as the router pods marking backends as DOWN and dropping traffic; but The behavior can be replicated with curl outside of the HAProxy pods via entering a debug shell to a host node (or SSH) and curling the pod IP directly. A significant percentage of packets time out to the target backend on intermittent subsequent calls.
      • We narrowed the behavior down to the moment we applied the NetworkPolicy for `allow-from-ingress` as outlined below - immediately the namespace began to drop packets on a curl loop running from an infra node directly against the pod IP (some 2-3% of all calls timed out).
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
          name: allow-from-openshift-ingress
          namespace: testing
          - from:
            - namespaceSelector:
                   policy-group.network.openshift.io/ingress: ""
          podSelector: {}
          - Ingress

      Version-Release number of selected component (if applicable):


      How reproducible:

      • every time, all namespaces with this network policy on this clusterversion (replicated on two clusters that underwent the same upgrade).

      Steps to Reproduce:

      1. Upgrade cluster to 4.13.30

      2. Apply test pod running basic HTTP instance at random port

      3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).

      4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).

      $ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done

      Actual results:

       - Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.

      Expected results:

       - traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.

      Additional info:

      • This behavior began immediately after completed upgrade from 4.13.24 to 4.13.30 and has been replicated on two separate clusters.
      • Customer has been forced to reinstall a cluster at downgraded version to ensure stability/deliverables for their user-base and this is a critical impact outage scenario for them

      – additional required template details in first comment below.

      So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with

      oc get ns openshift-host-network --show-labels
      # expected this
      # but before the fix you will see 

      Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged

      oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
      oc scale deployment network-operator -n openshift-network-operator --replicas=0

      and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears

      Potentially affected versions (may need to reproduce to confirm)

      4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070

      4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293

      4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039

      Mitigation/support KCS:

            npinaeva@redhat.com Nadia Pinaeva
            rhn-support-wrussell Will Russell
            Jean Chen Jean Chen
            15 Vote for this issue
            69 Start watching this issue