Uploaded image for project: 'Zero Trust Workload Identity Manager'
  1. Zero Trust Workload Identity Manager
  2. SPIRE-68

Updating the operands CR spec sometimes does not trigger reconciliation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • 2
    • False
    • Hide

      None

      Show
      None
    • False
    • OAPE Sprint 279
    • 1
    • Important

      It has been observed for quite some times. It is more likely to occur in a patching/recreation scenario: the CR was created previously, but day2 operations like patch won't trigger the reconciliation loop.

      For example, when updating fields in the SpireServer CR spec such as resources/affinity and other common configs, sometimes the operator does not propagate the updated values to the underlying StatefulSet as expected. Also the update does not trigger a reconciliation, so the changes in the CR spec are ignored.

       

      Example e2e log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_zero-trust-workload-identity-manager/30/pull-ci-openshift-zero-trust-workload-identity-manager-main-e2e-operator/1949849381674946560/artifacts/e2e-operator/test/build-log.txt

       

        STEP: Patching SpireServer object with resource specifications @ 07/28/25 15:19:54.923
        STEP: Waiting for SPIRE Server StatefulSet rolling update to start @ 07/28/25 15:19:54.942
        statefulset 'zero-trust-workload-identity-manager/spire-server' no rolling update in progress (current=spire-server-744fc6bfbd, update=spire-server-744fc6bfbd)
        statefulset 'zero-trust-workload-identity-manager/spire-server' no rolling update in progress (current=spire-server-744fc6bfbd, update=spire-server-744fc6bfbd) 
      ...
      [timeout after 2mins]

       

      In the operator pod log, there isn't anything skeptical: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_zero-trust-workload-identity-manager/30/pull-ci-openshift-zero-trust-workload-identity-manager-main-e2e-operator/1949849381674946560/artifacts/e2e-operator/gather-extra/artifacts/pods/zero-trust-workload-identity-manager_zero-trust-workload-identity-manager-controller-manager-ccbxbv6_manager.log

       

      Additionally, if I manually delete the operator controller-manager pod to force a restart, everything seemingly back to normal, the reconciliation will start.

      oc delete po -n zero-trust-workload-identity-manager zero-trust-workload-identity-manager-controller-manager-5crjdcz
      
      # or
      oc rollout restart deployment/zero-trust-workload-identity-manager-controller-manager -n zero-trust-workload-identity-manager 

      Expected behavior

      Updating the operand CR (e.g., changing spec.resources) should trigger reconciliation and cause the operator to update the corresponding StatefulSet/Daemonset with the new values, without manual restart the operator.

      Steps to reproduce

      • Apply SpireServer/SpireAgent CR with nil resources spec.
      export APP_DOMAIN=apps.$(oc get dns cluster -o jsonpath='{ .spec.baseDomain }')
      export CLUSTER_NAME=test01oc apply -f - <<EOF
      apiVersion: operator.openshift.io/v1alpha1
      kind: SpireServer
      metadata:
        name: cluster
      spec:
        trustDomain: $APP_DOMAIN
        clusterName: $CLUSTER_NAME
        caSubject:
          commonName: $APP_DOMAIN
          country: "US"
          organization: "RH"
        persistence:
          type: pvc
          size: "2Gi"
          accessMode: ReadWriteOncePod
        datastore:
          databaseType: sqlite3
          connectionString: "/run/spire/data/datastore.sqlite3"
          maxOpenConns: 100
          maxIdleConns: 2
          connMaxLifetime: 3600
      EOF
      
      oc get po -l app.kubernetes.io/name=server -n zero-trust-workload-identity-manager
      NAME             READY   STATUS    RESTARTS       AGE
      spire-server-0   2/2     Running   1 (108s ago)   111s 
      • Update the CR's spec.resources. (or any non-required config) 
      oc patch SpireServer cluster --type=merge -p="
      spec:
        resources:
          limits:
            cpu: 200m
            memory: 64Mi
          requests:
            cpu: 50m
            memory: 32Mi
      "
      
      # vice-vise, it you already spec.resources set, you can set it to nil and issue would still have chance to occur
      • Observe that the StatefulSet remains unchanged.
      oc get statefulset spire-server -n zero-trust-workload-identity-manager -o=jsonpath="{.spec.template.spec.containers[*].resources}"
      
      oc get pod -l app.kubernetes.io/name=server -n zero-trust-workload-identity-manager -o yaml 
      • Check the operator logs – reconciliation is not triggered.

       

      More context: The first time I noticed this is when testing CR recreation https://redhat-internal.slack.com/archives/C08TX9H7CCF/p1749219928264279?thread_ts=1749103562.892369&cid=C08TX9H7CCF; the second time is when debugging e2e test for CR common configs: patch the CR. It might be related to operator controller-manager cache issue.

              rh-ee-aagnihot Anirudh Agnihotri
              rh-ee-yuewu Yuedong Wu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: