Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23988

4.15 tech-preview GCP cluster churns capi-controller-manager and capi-operator-controller-manager Deployments

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • None
    • 4.15
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • CLOUD Sprint 249
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description

      First seen in build02 after updating to 4.15.0-ec.2, and reproduced in a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs) after updating to 4.15.0-0.nightly-2023-11-25-110147, the capi-controller-manager and capi-operator-controller-manager Deployments are churning, and that seems like unexpected behavior.

      Releases

      Seen in updates from ec.1 to ec.2 and from ec.1 to recent 4.15 nightlies. So far just on GCP. Other providers and/or releases might also be exposed; I'm not sure.

      Reproducer

      With a ClusterBot launch 4.15.0-ec.1 gcp,techpreview run (logs), post-install, pods look pretty stable/old in that namespace, which is great:

      $ oc -n openshift-cluster-api get pods
      NAME                                                READY   STATUS    RESTARTS      AGE
      capg-controller-manager-78b8c46c7-h7drk             1/1     Running   0             53m
      capi-controller-manager-8586f8d645-wnlbf            1/1     Running   0             54m
      capi-operator-controller-manager-6c69b65955-zdgdd   2/2     Running   2 (53m ago)   70m
      cluster-capi-operator-567ff84d9-gv5dv               1/1     Running   1 (51m ago)   70m
      

      Then kicking off an update, to a recent 4.15 nightly to pick up the fix for OCPBUGS-23467:

      $ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147
      warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
      warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
      warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
      Requested update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-11-25-110147
      

      And checking back in on gathered assets later, the update completed:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
      2023-11-27T17:49:45Z 2023-11-27T18:48:09Z Completed 4.15.0-0.nightly-2023-11-25-110147
      2023-11-27T16:36:20Z 2023-11-27T17:09:36Z Completed 4.15.0-ec.1
      

      And the two Deployments are churning:

      $ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-api" and .involvedObject.kind == "Deployment") | (.firstTimestamp // .metadata.creationTimestamp) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort | tail
      2023-11-27T20:01:04Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
      2023-11-27T20:01:24Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
      2023-11-27T20:04:54Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
      2023-11-27T20:05:14Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
      2023-11-27T20:08:45Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
      2023-11-27T20:09:04Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
      2023-11-27T20:12:35Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
      2023-11-27T20:12:53Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
      2023-11-27T20:16:26Z capi-operator-controller-manager ScalingReplicaSet: Scaled up replica set capi-operator-controller-manager-758844bdfb to 1
      2023-11-27T20:16:43Z capi-controller-manager ScalingReplicaSet: Scaled up replica set capi-controller-manager-7bbd8689f4 to 1
      

      Dropping into audit logs:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1729173441141018624/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
      $ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r .verb | sort | uniq -c
          372 create
          348 delete
         1338 get
          261 list
           35 patch
          934 update
          205 watch
      $ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -c 'select((.verb == "create" or .verb == "delete") and .objectRef.resource == "deployments") | {verb, objectRef, username: .user.username}' | sort | uniq -c | sort -n | tail
            4 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-version:default"}
           16 {"verb":"create","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
           16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
           16 {"verb":"delete","objectRef":{"resource":"deployments","namespace":"openshift-cluster-api","name":"capi-operator-controller-manager","apiGroup":"apps","apiVersion":"v1"},"username":"system:serviceaccount:openshift-cluster-api:default"}
      $ zgrep -h '"openshift-cluster-api"' kube-apiserver/*audit.log.gz | jq -r 'select(.objectRef.resource == "deployments" and (.objectRef.name == "capi-operator-controller-manager" or .objectRef.name == "capi-controller-manager") and .verb == "delete")  | .stageTimestamp + " " + .objectRef.name + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | sort | tail
      2023-11-27T20:01:24.397215Z capi-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
      2023-11-27T20:01:24.414671Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-dbvgw
      2023-11-27T20:09:04.585004Z capi-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
      2023-11-27T20:09:04.604897Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-cvtql
      2023-11-27T20:12:53.359244Z capi-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
      2023-11-27T20:12:53.376019Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-t9x2c
      2023-11-27T20:16:43.492054Z capi-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
      2023-11-27T20:16:43.507115Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-lkghh
      2023-11-27T20:20:34.492637Z capi-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2
      2023-11-27T20:20:34.513301Z capi-operator-controller-manager capi-operator-controller-manager-758844bdfb-vhfq2
      

      So that's:

      1. The cluster-version operator creating the capi-operator-controller-manager Deployment as requested by the cluster-API operator's manifest.
      2. The Deployment's pod asking to delete its own Deployment. This bug is about understanding this step.
      3. Return to step 1.

              ddonati@redhat.com Damiano Donati
              trking W. Trevor King
              None
              None
              Milind Yadav Milind Yadav
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: