Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23228

hosted-cluster-config-operator-manager should throttle creation attempts

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 4.16.0
    • 4.14
    • HyperShift
    • Moderate
    • No
    • Hypershift Sprint 246, Hypershift Sprint 247
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: Unable to find resource causes failed recreation attempts.
      Consequence: A major amount of 409 response codes in hosted cluster config operator logs
      Fix: Add specific resources to cache
      Result: hosted cluster config operator doesn't try to recreate existing resources
      Show
      Cause: Unable to find resource causes failed recreation attempts. Consequence: A major amount of 409 response codes in hosted cluster config operator logs Fix: Add specific resources to cache Result: hosted cluster config operator doesn't try to recreate existing resources
    • Bug Fix
    • In Progress

    Description

      Description of problem:

      Release controller > 4.14.2 > HyperShift conformance run > gathered assets:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c
           65 hosted-cluster-config-operator-manager
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5
      2023-11-09T17:17:15.130454Z 409 AlreadyExists
      2023-11-09T17:17:15.163256Z 409 AlreadyExists
      2023-11-09T17:17:15.198908Z 409 AlreadyExists
      2023-11-09T17:17:15.230532Z 409 AlreadyExists
      2023-11-09T17:17:22.899579Z 409 AlreadyExists
      

      That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?

      Version-Release number of selected component (if applicable):

      4.14.2. I haven't checked other releases.

      How reproducible:

      Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.

      Steps to Reproduce:

      1. Dump a hosted cluster.
      2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.

      Actual results:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c
          130 create 409
      

      Expected results:

      Zero or rare 409 creation request from this user-agent.

      Additional info:

      The user agent seems to be defined here, so likely the fix will involve changes to that manager.

      Attachments

        Issue Links

          Activity

            People

              pstefans@redhat.com Patryk Stefanski
              trking W. Trevor King
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: