-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.15, 4.16
-
Moderate
-
None
-
Rejected
-
False
-
Description of problem
CI suites running on test-platform build clusters are having trouble reliably resolving DNS for cluster-under-test resources, causing CI failures. The symptoms seem to be distributed among the build clusters over the past day:
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=build-log&context=0&search=dial+tcp:+lookup+api.*on+172.30.0.10:53:+no+such+host&search=Using+namespace' | jq -r 'to_entries[].value | select(length > 1)["Using namespace"][].context[]' | sed 's/.*\(build[0-9]*\).*/\1/' | sort | uniq -c 2 build01 4 build02 10 build03 2 build04 11 build05 8 build09
and those build clusters are mostly 4.15 and 4.16. I'm not entirely clear if this is an in-cluster-DNS-component issue, or SDN/OVN-networking issue, or an external-to-the-cluster-DNS issue, or what. Debugging assistance welcome
Version-Release number of selected component (if applicable)
The version of the cluster-under-test does not seem relevant, but the build clusters seeing the issue are mostly recent 4.15 and 4.16.
How reproducible
A few dozen hits per day out of thousands of CI runs, so a low rate. But still high enough to be causing Component Readiness issues.
Steps to Reproduce
Unclear.
Actual results
Occasional DNS-resolution attempts for cluster-under-test resources fail, causing the CI run to fail, presumably because of some kind of DNS instability biting the test pod running on the build cluster.
Expected results
Reliable DNS for CI pods running on build clusters.
Additional info
Recent changes in managed-cluster-config#2158 and release#54210 have returned build clusters to stock dns-default tolerations, but that does not seem to have resolved the issue.
- is related to
-
OCPBUGS-39580 clusteroperator/console: unexpected state transitions during e2e test run
- ASSIGNED