Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37163

DNS resolution instability in CI build* clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.15, 4.16
    • Networking / DNS
    • Moderate
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem

      CI suites running on test-platform build clusters are having trouble reliably resolving DNS for cluster-under-test resources, causing CI failures. The symptoms seem to be distributed among the build clusters over the past day:

      $ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=build-log&context=0&search=dial+tcp:+lookup+api.*on+172.30.0.10:53:+no+such+host&search=Using+namespace' | jq -r 'to_entries[].value | select(length > 1)["Using namespace"][].context[]' | sed 's/.*\(build[0-9]*\).*/\1/' | sort | uniq -c
            2 build01
            4 build02
           10 build03
            2 build04
           11 build05
            8 build09
      

      and those build clusters are mostly 4.15 and 4.16. I'm not entirely clear if this is an in-cluster-DNS-component issue, or SDN/OVN-networking issue, or an external-to-the-cluster-DNS issue, or what. Debugging assistance welcome

      Version-Release number of selected component (if applicable)

      The version of the cluster-under-test does not seem relevant, but the build clusters seeing the issue are mostly recent 4.15 and 4.16.

      How reproducible

      A few dozen hits per day out of thousands of CI runs, so a low rate. But still high enough to be causing Component Readiness issues.

      Steps to Reproduce

      Unclear.

      Actual results

      Occasional DNS-resolution attempts for cluster-under-test resources fail, causing the CI run to fail, presumably because of some kind of DNS instability biting the test pod running on the build cluster.

      Expected results

      Reliable DNS for CI pods running on build clusters.

      Additional info

      Recent changes in managed-cluster-config#2158 and release#54210 have returned build clusters to stock dns-default tolerations, but that does not seem to have resolved the issue.

            mmasters1@redhat.com Miciah Masters
            trking W. Trevor King
            Hongan Li Hongan Li
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: