Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14
Component/s: HyperShift
Labels:
- triaged

Severity:
Moderate
Regression:
No
Sprint:
Hypershift Sprint 246, Hypershift Sprint 247
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the inability to find a resource caused re-creation attempts to fail. As a consequence, many `409` response codes were logged in hosted cluster config operator logs. With this update, specific resources were added to the cache so that the hosted cluster config operator does not try to re-create existing resources. (link:https://issues.redhat.com/browse/OCPBUGS-23228[*~~OCPBUGS-23228~~*])

Show
* Previously, the inability to find a resource caused re-creation attempts to fail. As a consequence, many `409` response codes were logged in hosted cluster config operator logs. With this update, specific resources were added to the cache so that the hosted cluster config operator does not try to re-create existing resources. (link: https://issues.redhat.com/browse/OCPBUGS-23228 [* OCPBUGS-23228 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Release controller > 4.14.2 > HyperShift conformance run > gathered assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c
     65 hosted-cluster-config-operator-manager
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5
2023-11-09T17:17:15.130454Z 409 AlreadyExists
2023-11-09T17:17:15.163256Z 409 AlreadyExists
2023-11-09T17:17:15.198908Z 409 AlreadyExists
2023-11-09T17:17:15.230532Z 409 AlreadyExists
2023-11-09T17:17:22.899579Z 409 AlreadyExists

That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?

Version-Release number of selected component (if applicable):

4.14.2. I haven't checked other releases.

How reproducible:

Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.

Steps to Reproduce:

1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.

Actual results:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c
    130 create 409

Expected results:

Zero or rare 409 creation request from this user-agent.

Additional info:

The user agent seems to be defined here, so likely the fix will involve changes to that manager.

blocks

OCPBUGS-27149 hosted-cluster-config-operator-manager should throttle creation attempts

Closed

OCPBUGS-27150 hosted-cluster-config-operator-manager should throttle creation attempts

Closed

is cloned by

OCPBUGS-27149 hosted-cluster-config-operator-manager should throttle creation attempts

Closed

OCPBUGS-27150 hosted-cluster-config-operator-manager should throttle creation attempts

Closed

links to

openshift/hypershift#3396: OCPBUGS-23228: Add storage, csisnapshotcontroller and clustercsidrive…

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(1 links to)

Assignee:: Patryk Stefanski

Reporter:: W. Trevor King

QA Contact:: Jie Zhao

Doc Contact:: Laura Hinson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/11/14 12:13 AM

Updated:: 2024/06/27 11:31 AM

Resolved:: 2024/06/27 11:31 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates