Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11423

[4.10] Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.10
    • None
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-11035. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-10314. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:

      Description of problem:

      Customer running a cluster with following config:
      4.10.23
      AWS/IPI
      OVNKubernetes
      
      Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.
      
      Example:
      Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
      Deployment2 with 1 pod hosting a service and route exists in same namespace
      Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.
      
      Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)
      
      Pod A remains able to reach the local gateway on the node
      
      Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.
      
      Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.
      
      I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.
      
      Additional relevant data:
      - pods affects throughout cluster; no specific project/service/deployment/application
      - pods ride on different nodes all the time (no one node affected)
      - pods with fail condition are on same node with other pods without issue
      - multiple namespaces see this problem
      - all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).
      
      
      
      

      Version-Release number of selected component (if applicable):

      4.10.23

      How reproducible:

      every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

      Steps to Reproduce:

      1. deploy pod with at least two replicas in a namespace with allow-from same network policy
      2. deploy a different service and route example httpd instance in same namespace
      3. observe that one of the two pods may fail to reach service IP after some time
      4. apply annotation to pod and it is immediately able to reach services again.

      Actual results:

      pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster. 

      Expected results:

      pods should not lose access to service network peers. 

      Additional info:

      see next comments for relevant uploads/sosreports and inspects.

            apanatto@redhat.com Andrea Panattoni
            openshift-crt-jira-prow OpenShift Prow Bot
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: