Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2334

NE-956: Configurable LB Source Ranges breaks TestScopeChange

    XMLWordPrintable

Details

    • Critical
    • Sprint 226
    • Proposed
    • Hide

      None

      Show
      None
    • NA

    Description

      Description of problem:

      Since openenshift/cluster-ingress-operator#817 merged, the e2e-aws-operator CI job has been failing for multiple PRs in the cluster-ingress-operator repository.  In particular, the TestScopeChange test has been consistently failing. Example failures:

      The operator is repeatedly logging errors like the following in those failing CI jobs:

      ERROR    operator.dns_controller    controller/controller.go:121    failed to delete dnsrecord; will retry    \{"dnsrecord": {"metadata":{"name":"scope-wildcard","namespace":"openshift-ingress-operator","uid":"2cb9936f-d6a0-4377-b3ed-c5167c5e9e4d","resourceVersion":"42217","generation":2,"creationTimestamp":"2022-10-13T16:19:23Z","deletionTimestamp":"2022-10-13T16:20:27Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"scope"},"ownerReferences":[\{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"scope","uid":"713ac1c5-451b-42d1-89fd-c3910eb80fe3","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":\{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":\{".":{},"k:\{\"uid\":\"713ac1c5-451b-42d1-89fd-c3910eb80fe3\"}":{}}},"f:spec":\{".":{},"f:dnsManagementPolicy":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}}}},\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}},"subresource":"status"}]},"spec":\{"dnsName":"*.scope.ci-op-x1j7dsgt-43abb.origin-ci-int-aws.dev.rhcloud.com.","targets":["af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"},"status":\{"zones":[{"dnsZone":{"tags":{"Name":"ci-op-x1j7dsgt-43abb-45zhd-int","kubernetes.io/cluster/ci-op-x1j7dsgt-43abb-45zhd":"owned"}},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:23Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},\{"dnsZone":{"id":"Z2GYOLTZHS5VK"},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:24Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com", "errorCauses": [\{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}, \{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}]}}}
      

      The scope-wildcard dnsrecord is created for the TestScopeChange test.

      Using search.ci, it seems that the failures occurred many times on #817 before it merged and then started occurring for the other PRs after #817 merged.

      I filed a PR, openshift/cluster-ingress-operator#838, that reverts #817. I have run the e2e-aws-operator CI job on this PR twice. While the job has failed both times, the TestScopeChange test did not fail either time.

      At this point, we have strong evidence that #817 is causing TestScopeChange to fail.

      gspence@redhat.com did some testing and determined that there is some interaction between TestAllowedSourceRangesStatus and TestScopeChange. It may suffice to serialize some tests (TestScopeChanged is currently a parallel test, as is TestAllowedSourceRangesStatus and two other tests that #817 adds).

      If the problem cannot be resolved by serializing tests, it may be necessary to revert #817 to unblock CI.

      Note that this issue is blocking NE-942, NE-1072, and NE-682, as well as any bugfix PRs for the master branch in openshift/cluster-ingress-operator.

      Version-Release number of selected component (if applicable):

      4.12

      How reproducible:

      Consistently.

      Steps to Reproduce:

      1. Run CI on a PR against the master branch of cluster-ingress-operator.

      Actual results:

      The TestScopeChange test fails as described.

      Expected results:

      TestScopeChange should not fail.

       

       

      Attachments

        Activity

          People

            mmasters1@redhat.com Miciah Masters
            mmasters1@redhat.com Miciah Masters
            Melvin Joseph Melvin Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: