Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16818

4.13.0 - certain services are un-routable from operator pods within the cluster, including coreDNS and ingress-canary clusterIPs.

XMLWordPrintable

    • Moderate
    • No
    • SDN Sprint 240, SDN Sprint 241
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Issue: Customer has deployed a new cluster and has observed that console/auth routes are degraded. Observed that openshift-ingress-operator is firing alert on reachability for the canary route.
      
      - Observed that the canary route pods are up/serving traffic
      - Observed that the router pods CAN reach the route/pods for canary traffic
      - Observed that the openshift-ingress-operator pod CANNOT communicate with the coreDNS service IP at 172.30.0.10 (therefore failing route lookup/subsequent call)
      - Observed that the openshift-ingress-operator pod CANNOT communicate with the service for ingress-canary 
      - Observed that TEST POD in NEW NAMESPACE can ALSO not call the service for ingress-canary
      - Observed that deleting the service is not automatically recreated; rebuilding a fresh service with new clusterIP allows TEST POD to succeed in calls to the new service for ingress-canary, but operator pod is still unable to call the service IP.
      - Observed that the target lr-lb-list for the service on all hosts exists, and is valid/plumbed as expected to the correct canary backends. 
      - Attempted to rebuild the OVNKUBE database (*succeeded) --> no change.
      
      - No firewall rules in the way (single switch interconnect for all vms on the cluster)
      - geneve port is unblocked; and we can call specific pod IP's throughout the cluster
      - The default kube-apiservice clusterIP (172.30.0.1) is reachable from all pods, not blocked, which implies that generally, ovn flows are working but CERTAIN flows are obstructed
      
      - all nodes are on the same vlan/subnet and the network plane is flat for the platform.

      Version-Release number of selected component (if applicable):

      4.13.0, vmware, UPI, ovnkubernetes

      How reproducible:

      every time

      Steps to Reproduce:

      1. spin up a test pod using generic ubi8 image with IP tools from quay.io/rhn_support_wrussell/iputils-container:latest
      2. curl target services. Attempt dig on service and observe timeout from coredns service. specify target upstream DNS service with @<upstreamIP> and observe dig succeed immediately on calls.
      3. observe curl to service IP from node and pod on node fail. Recreated test services work (at least when we tested to the ingress-canary namespace service that we deleted/rebuilt)
      4. observe that the openshift-ingress-operator pod can NOT curl the ingress-canary service even after being recreated. 

      Actual results:

      communication failure in the cluster

      Expected results:

      pods/services that are unobstructed by networkpolicy and network layers that are flat should be able to communicate 

      Additional info:

      data uploads in next comment attached to issue.

              ffernand@redhat.com Flavio Fernandes (Inactive)
              rhn-support-wrussell Will Russell
              Anurag Saxena Anurag Saxena
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: