-
Bug
-
Resolution: Done
-
Critical
-
4.12
-
Critical
-
None
-
Sprint 226
-
1
-
Proposed
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
Since openenshift/cluster-ingress-operator#817 merged, the e2e-aws-operator CI job has been failing for multiple PRs in the cluster-ingress-operator repository. In particular, the TestScopeChange test has been consistently failing. Example failures:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/837/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1580571113170145280
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/779/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1580597541211213824
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/836/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1580641046826586112
The operator is repeatedly logging errors like the following in those failing CI jobs:
ERROR operator.dns_controller controller/controller.go:121 failed to delete dnsrecord; will retry \{"dnsrecord": {"metadata":{"name":"scope-wildcard","namespace":"openshift-ingress-operator","uid":"2cb9936f-d6a0-4377-b3ed-c5167c5e9e4d","resourceVersion":"42217","generation":2,"creationTimestamp":"2022-10-13T16:19:23Z","deletionTimestamp":"2022-10-13T16:20:27Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"scope"},"ownerReferences":[\{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"scope","uid":"713ac1c5-451b-42d1-89fd-c3910eb80fe3","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":\{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":\{".":{},"k:\{\"uid\":\"713ac1c5-451b-42d1-89fd-c3910eb80fe3\"}":{}}},"f:spec":\{".":{},"f:dnsManagementPolicy":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}}}},\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}},"subresource":"status"}]},"spec":\{"dnsName":"*.scope.ci-op-x1j7dsgt-43abb.origin-ci-int-aws.dev.rhcloud.com.","targets":["af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"},"status":\{"zones":[{"dnsZone":{"tags":{"Name":"ci-op-x1j7dsgt-43abb-45zhd-int","kubernetes.io/cluster/ci-op-x1j7dsgt-43abb-45zhd":"owned"}},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:23Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},\{"dnsZone":{"id":"Z2GYOLTZHS5VK"},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:24Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com", "errorCauses": [\{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}, \{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}]}}}
The scope-wildcard dnsrecord is created for the TestScopeChange test.
Using search.ci, it seems that the failures occurred many times on #817 before it merged and then started occurring for the other PRs after #817 merged.
I filed a PR, openshift/cluster-ingress-operator#838, that reverts #817. I have run the e2e-aws-operator CI job on this PR twice. While the job has failed both times, the TestScopeChange test did not fail either time.
At this point, we have strong evidence that #817 is causing TestScopeChange to fail.
gspence@redhat.com did some testing and determined that there is some interaction between TestAllowedSourceRangesStatus and TestScopeChange. It may suffice to serialize some tests (TestScopeChanged is currently a parallel test, as is TestAllowedSourceRangesStatus and two other tests that #817 adds).
If the problem cannot be resolved by serializing tests, it may be necessary to revert #817 to unblock CI.
Note that this issue is blocking NE-942, NE-1072, and NE-682, as well as any bugfix PRs for the master branch in openshift/cluster-ingress-operator.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Consistently.
Steps to Reproduce:
1. Run CI on a PR against the master branch of cluster-ingress-operator.
Actual results:
The TestScopeChange test fails as described.
Expected results:
TestScopeChange should not fail.