Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.12.z
Affects Version/s: 4.12
Component/s: Networking / router
Labels:
- cluster-ingress-operator

Severity:
Critical
Regression:
None
Sprint:
Sprint 226
sprint_count:
1
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.12.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Since openenshift/cluster-ingress-operator#817 merged, the e2e-aws-operator CI job has been failing for multiple PRs in the cluster-ingress-operator repository. In particular, the TestScopeChange test has been consistently failing. Example failures:

The operator is repeatedly logging errors like the following in those failing CI jobs:

ERROR    operator.dns_controller    controller/controller.go:121    failed to delete dnsrecord; will retry    \{"dnsrecord": {"metadata":{"name":"scope-wildcard","namespace":"openshift-ingress-operator","uid":"2cb9936f-d6a0-4377-b3ed-c5167c5e9e4d","resourceVersion":"42217","generation":2,"creationTimestamp":"2022-10-13T16:19:23Z","deletionTimestamp":"2022-10-13T16:20:27Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"scope"},"ownerReferences":[\{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"scope","uid":"713ac1c5-451b-42d1-89fd-c3910eb80fe3","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":\{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":\{".":{},"k:\{\"uid\":\"713ac1c5-451b-42d1-89fd-c3910eb80fe3\"}":{}}},"f:spec":\{".":{},"f:dnsManagementPolicy":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}}}},\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}},"subresource":"status"}]},"spec":\{"dnsName":"*.scope.ci-op-x1j7dsgt-43abb.origin-ci-int-aws.dev.rhcloud.com.","targets":["af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"},"status":\{"zones":[{"dnsZone":{"tags":{"Name":"ci-op-x1j7dsgt-43abb-45zhd-int","kubernetes.io/cluster/ci-op-x1j7dsgt-43abb-45zhd":"owned"}},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:23Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},\{"dnsZone":{"id":"Z2GYOLTZHS5VK"},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:24Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com", "errorCauses": [\{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}, \{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}]}}}

The scope-wildcard dnsrecord is created for the TestScopeChange test.

Using search.ci, it seems that the failures occurred many times on #817 before it merged and then started occurring for the other PRs after #817 merged.

I filed a PR, openshift/cluster-ingress-operator#838, that reverts #817. I have run the e2e-aws-operator CI job on this PR twice. While the job has failed both times, the TestScopeChange test did not fail either time.

At this point, we have strong evidence that #817 is causing TestScopeChange to fail.

gspence@redhat.com did some testing and determined that there is some interaction between TestAllowedSourceRangesStatus and TestScopeChange. It may suffice to serialize some tests (TestScopeChanged is currently a parallel test, as is TestAllowedSourceRangesStatus and two other tests that #817 adds).

If the problem cannot be resolved by serializing tests, it may be necessary to revert #817 to unblock CI.

Note that this issue is blocking ~~NE-942~~, ~~NE-1072~~, and ~~NE-682~~, as well as any bugfix PRs for the master branch in openshift/cluster-ingress-operator.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Consistently.

Steps to Reproduce:

1. Run CI on a PR against the master branch of cluster-ingress-operator.

Actual results:

The TestScopeChange test fails as described.

Expected results:

TestScopeChange should not fail.

links to

openshift/cluster-ingress-operator#839: OCPBUGS-2334: Added nil check for service object on load balancer scope change

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates