Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48780

IBM Cloud E2E DNS Failures

XMLWordPrintable

    • Low
    • None
    • NE Sprint 265
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • In Progress

      Description of problem:

       

      The e2e-ibmcloud-operator presubmit job for the cluster-ingress-operator repo introduced in https://github.com/openshift/release/pull/56785  always fails due to DNS. Note that this job has `always_run: false` and `optional: true` so it requires calling /test e2e-ibmcloud-operator on a PR to make it appear. These failures are not blocking any PRs from merging. Example failure.

      The issue is that IBM Cloud has DNS propagation issues, similar to the AWS DNS issues (OCPBUGS-14966), except:

      1. There isn't a way to adjust the IBMCloud DNS SOA TTL because IBMCloud DNS is managed by a 3rd party (cloudflare I think, slack ref).
      2. Our AWS E2E tests run on AWS test runner clusters; whereas our IBMCloud E2E test run on the same AWS test runner clusters (DNS resolution isn't as reliable in AWS test runner cluster for IBM Cloud DNS names)

      The PR https://github.com/openshift/cluster-ingress-operator/pull/1164 was an attempt at fixing the issue by both resolving the DNS name inside of the cluster and allowing for a couple minute "warmup" interval to avoid negative caching. I found (via https://github.com/openshift/cluster-ingress-operator/pull/1132) that the SOA TTL is ~30 minutes, which if you trigger negative caching, you will have to wait 30 minutes for the IBM DNS Resolver to refresh the DNS name.

      However, I found that if you wait ~7 minutes for the DNS record to propagate and don't query the DNS name, it will work after that 7 minute wait (I call it the "warmup" period).

      The tests affected are any tests that use a DNS name (wildcard or load balancer record):

      • TestManagedDNSToUnmanagedDNSIngressController
      • TestUnmanagedDNSToManagedDNSIngressController
      • TestUnmanagedDNSToManagedDNSInternalIngressController
      • TestConnectTimeout

      The two paths I can think of are:

      1. Continue https://github.com/openshift/cluster-ingress-operator/pull/1164 and adjust the warm up time to 7+ minutes
      2. Or just skip these tests for IBM Cloud (admit we can't use IBMCloud DNS records in testing)

      Version-Release number of selected component (if applicable):

      4.19    

      How reproducible:

      90-100%    

      Steps to Reproduce:

          1. Run /test e2e-ibmcloud-operator

      Actual results:

          Tests are flakey

      Expected results:

          Tests should work reliably

      Additional info:

          

              gspence@redhat.com Grant Spence
              gspence@redhat.com Grant Spence
              Hongan Li Hongan Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: