Loading...

Type: Bug
Resolution: Obsolete
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15
Component/s: Cloud Compute / Unknown
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CLOUD Sprint 249
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description

First seen in build02 after updating to 4.15.0-ec.2, and reproduced in a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs) after updating to 4.15.0-0.nightly-2023-11-25-110147, the capi-controller-manager and capi-operator-controller-manager Deployments are churning, and that seems like unexpected behavior.

Releases

Seen in updates from ec.1 to ec.2 and from ec.1 to recent 4.15 nightlies. So far just on GCP. Other providers and/or releases might also be exposed; I'm not sure.

Reproducer

With a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs), post-install, pods look pretty stable/old in that namespace, which is great:

$ oc -n openshift-cluster-api get pods
NAME                                                READY   STATUS    RESTARTS      AGE
capg-controller-manager-78b8c46c7-h7drk             1/1     Running   0             53m
capi-controller-manager-8586f8d645-wnlbf            1/1     Running   0             54m
capi-operator-controller-manager-6c69b65955-zdgdd   2/2     Running   2 (53m ago)   70m
cluster-capi-operator-567ff84d9-gv5dv               1/1     Running   1 (51m ago)   70m

Then kicking off an update, to a recent 4.15 nightly to pick up the fix for ~~OCPBUGS-23467~~:

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147

And checking back in on gathered assets later, the update completed:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2023-11-27T17:49:45Z 2023-11-27T18:48:09Z Completed 4.15.0-0.nightly-2023-11-25-110147
2023-11-27T16:36:20Z 2023-11-27T17:09:36Z Completed 4.15.0-ec.1

And the two Deployments are churning:

$ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-api" and .involvedObject.kind == "Deployment") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | tail
2023-11-27T20:01:04Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
2023-11-27T20:01:24Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
2023-11-27T20:04:54Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
2023-11-27T20:05:14Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
2023-11-27T20:08:45Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
2023-11-27T20:09:04Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
2023-11-27T20:12:35Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
2023-11-27T20:12:53Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
2023-11-27T20:16:26Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
2023-11-27T20:16:43Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1

Dropping into audit logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r .verb | sort | uniq -c
    372 create
    348 delete
   1338 get
    261 list
     35 patch
    934 update
    205 watch
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -c 'select((.verb == "create" or .verb == "delete") and .objectRef.resource == "deployments") | {verb, objectRef, username: .user.username}' | sort | uniq -c | sort -n | tail
      4 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-version:default"}
     16 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
     16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
     16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r 'select(.objectRef.resource == "deployments" and (.objectRef.name == "capi-operator-controller-manager" or .objectRef.name == "capi-controller-manager") and .verb == "delete")  | .stageTimestamp + " " + .objectRef.name + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | sort | tail
2023-11-27T20:01:24.397215Z capi-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
2023-11-27T20:01:24.414671Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
2023-11-27T20:09:04.585004Z capi-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
2023-11-27T20:09:04.604897Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
2023-11-27T20:12:53.359244Z capi-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
2023-11-27T20:12:53.376019Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
2023-11-27T20:16:43.492054Z capi-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
2023-11-27T20:16:43.507115Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
2023-11-27T20:20:34.492637Z capi-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2
2023-11-27T20:20:34.513301Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2

So that's:

1. The cluster-version operator creating the capi-operator-controller-manager Deployment as requested by the cluster-API operator's manifest.
2. The Deployment's pod asking to delete its own Deployment. This bug is about understanding this step.
3. Return to step 1.

Details

Description

Description

Releases

Reproducer

Attachments

Easy Agile Planning Poker

Activity

People

Dates