-
Bug
-
Resolution: Obsolete
-
Undefined
-
None
-
4.15
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
CLOUD Sprint 249
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description
First seen in build02 after updating to 4.15.0-ec.2, and reproduced in a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs) after updating to 4.15.0-0.nightly-2023-11-25-110147, the capi-controller-manager and capi-operator-controller-manager Deployments are churning, and that seems like unexpected behavior.
Releases
Seen in updates from ec.1 to ec.2 and from ec.1 to recent 4.15 nightlies. So far just on GCP. Other providers and/or releases might also be exposed; I'm not sure.
Reproducer
With a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs), post-install, pods look pretty stable/old in that namespace, which is great:
$ oc -n openshift-cluster-api get pods NAME READY STATUS RESTARTS AGE capg-controller-manager-78b8c46c7-h7drk 1/1 Running 0 53m capi-controller-manager-8586f8d645-wnlbf 1/1 Running 0 54m capi-operator-controller-manager-6c69b65955-zdgdd 2/2 Running 2 (53m ago) 70m cluster-capi-operator-567ff84d9-gv5dv 1/1 Running 1 (51m ago) 70m
Then kicking off an update, to a recent 4.15 nightly to pick up the fix for OCPBUGS-23467:
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147 warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requested update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147
And checking back in on gathered assets later, the update completed:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' 2023-11-27T17:49:45Z 2023-11-27T18:48:09Z Completed 4.15.0-0.nightly-2023-11-25-110147 2023-11-27T16:36:20Z 2023-11-27T17:09:36Z Completed 4.15.0-ec.1
And the two Deployments are churning:
$ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-api" and .involvedObject.kind == "Deployment") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | tail 2023-11-27T20:01:04Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1 2023-11-27T20:01:24Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1 2023-11-27T20:04:54Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1 2023-11-27T20:05:14Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1 2023-11-27T20:08:45Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1 2023-11-27T20:09:04Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1 2023-11-27T20:12:35Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1 2023-11-27T20:12:53Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1 2023-11-27T20:16:26Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1 2023-11-27T20:16:43Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
Dropping into audit logs:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r .verb | sort | uniq -c
372 create
348 delete
1338 get
261 list
35 patch
934 update
205 watch
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -c 'select((.verb == "create" or .verb == "delete") and .objectRef.resource == "deployments") | {verb, objectRef, username: .user.username}' | sort | uniq -c | sort -n | tail
4 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-version:default"}
16 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
$ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r 'select(.objectRef.resource == "deployments" and (.objectRef.name == "capi-operator-controller-manager" or .objectRef.name == "capi-controller-manager") and .verb == "delete") | .stageTimestamp + " " + .objectRef.name + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | sort | tail
2023-11-27T20:01:24.397215Z capi-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
2023-11-27T20:01:24.414671Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
2023-11-27T20:09:04.585004Z capi-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
2023-11-27T20:09:04.604897Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
2023-11-27T20:12:53.359244Z capi-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
2023-11-27T20:12:53.376019Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
2023-11-27T20:16:43.492054Z capi-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
2023-11-27T20:16:43.507115Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
2023-11-27T20:20:34.492637Z capi-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2
2023-11-27T20:20:34.513301Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2
So that's:
1. The cluster-version operator creating the capi-operator-controller-manager Deployment as requested by the cluster-API operator's manifest.
2. The Deployment's pod asking to delete its own Deployment. This bug is about understanding this step.
3. Return to step 1.