Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30970

Upgrade from 4.15 to 4.16 fails because of kubelet reporting "Failed to register CRI auth plugins" error

XMLWordPrintable

    • Important
    • No
    • CLOUD Sprint 252
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * {product-title} {product-version} includes `gcr` and `acr` {op-system-base-full} credential providers so that future upgrades to later versions of {product-title} that require {op-system-base} compute nodes deployed on a cluster do not result in a failed installation. (link:https://issues.redhat.com/browse/OCPBUGS-30970[*OCPBUGS-30970*])
      Show
      * {product-title} {product-version} includes `gcr` and `acr` {op-system-base-full} credential providers so that future upgrades to later versions of {product-title} that require {op-system-base} compute nodes deployed on a cluster do not result in a failed installation. (link: https://issues.redhat.com/browse/OCPBUGS-30970 [* OCPBUGS-30970 *])
    • Enhancement
    • Done

      Description of problem:

      Upgrade from 4.15 to 4.16 is failing because kubelet reports this error:
      
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411346    7755 kubelet.go:308] "Adding static pod path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411380    7755 file.go:69] "Watching path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411406    7755 kubelet.go:319] "Adding apiserver pod source"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411426    7755 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.414274    7755 kuberuntime_manager.go:257] "Container runtime initialized" containerRuntime="cri-o" version="1.28.4-4.rhaos4.15.git92d1839.el8" apiVersion="v1"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: E0315 17:03:31.414963    7755 kuberuntime_manager.go:273] "Failed to register CRI auth plugins" err="plugin binary executable /usr/libexec/kubelet-image-credential-provider-plugins/acr-credential-provider did not exist"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Consumed 155ms CPU time
      
      
      
      
      
      We have seen this issue in prow job periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-workers-rhel8-f28 (a cluster with rhel workers) and in manual upgrades in IPI on GCP clusters (a cluster with coreos workers).
      
          

      Version-Release number of selected component (if applicable):

       Upgrade from 4.15.3 to 4.16.0-0.nightly-2024-03-13-061822
      
      oc get clusterversion -o yaml
      ...
          history:
          - acceptedRisks: |-
              Target release version="" image="registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b against keyrings: verifier-public-key-redhat
              [2024-03-15T15:33:11Z: prefix sha256-da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b in config map signatures-managed: no more signatures to check, 2024-03-15T15:33:11Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b/signature-1: no more signatures to check, 2024-03-15T15:33:11Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b/signature-1: no more signatures to check, 2024-03-15T15:33:11Z: parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2024-03-15T15:33:11Z: serial signature store wrapping ClusterVersion signatureStores unset, falling back to default stores, parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2024-03-15T15:33:11Z: serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", serial signature store wrapping ClusterVersion signatureStores unset, falling back to default stores, parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check]
              Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.15.3 to 4.16.0-0.nightly-2024-03-13-061822 is probably neither recommended nor supported.
            completionTime: null
            image: registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b
            startedTime: "2024-03-15T15:33:28Z"
            state: Partial
            verified: false
            version: 4.16.0-0.nightly-2024-03-13-061822
          - completionTime: "2024-03-15T13:33:08Z"
            image: registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:8e8c6c2645553e6df8eb7985d8cb322f333a4152453e2aa85fff24ac5e0755b0
            startedTime: "2024-03-15T13:02:04Z"
            state: Completed
            verified: false
            version: 4.15.3
      
      
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Upgrade from 4.15 to 4.16 using prow job periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-workers-rhel8-f28 or an IPI on GCP cluster.
          
          

      Actual results:

      Worker nodes do not join the cluster when they are rebooted:
      
      sh-4.4$ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-b566c3af4e215e2a77e6f9d9e5a988de   True      False      False      3              3                   3                     0                      3h59m
      worker   rendered-worker-21862c92d0f14a4842f6093f65571bd1   False     True       False      3              0                   0                     0                      3h59m
      
      sh-4.4$ oc get nodes
      NAME                                  STATUS                        ROLES                  AGE     VERSION
      ci-op-wb5fkm5k-e450c-s6m96-master-0   Ready                         control-plane,master   4h5m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-master-1   Ready                         control-plane,master   4h6m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-master-2   Ready                         control-plane,master   4h6m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-rhel-1     NotReady,SchedulingDisabled   worker                 3h17m   v1.28.7+6e2789b
      ci-op-wb5fkm5k-e450c-s6m96-rhel-2     Ready                         worker                 3h17m   v1.28.7+6e2789b
      ci-op-wb5fkm5k-e450c-s6m96-rhel-3     Ready                         worker                 3h17m   v1.28.7+6e2789b
      
      In the NotReady node we can see this error in kubelet
      
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411346    7755 kubelet.go:308] "Adding static pod path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411380    7755 file.go:69] "Watching path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411406    7755 kubelet.go:319] "Adding apiserver pod source"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411426    7755 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.414274    7755 kuberuntime_manager.go:257] "Container runtime initialized" containerRuntime="cri-o" version="1.28.4-4.rhaos4.15.git92d1839.el8" apiVersion="v1"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: E0315 17:03:31.414963    7755 kuberuntime_manager.go:273] "Failed to register CRI auth plugins" err="plugin binary executable /usr/libexec/kubelet-image-credential-provider-plugins/acr-credential-provider did not exist"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Consumed 155ms CPU time
      
          

      Expected results:

      The upgrade should be executed without failures
      
          

      Additional info:

      In the first comment you can find the must-gather file and the journal.logs
      
          

              rh-ee-tbarberb Theo Barber-Bany
              sregidor@redhat.com Sergio Regidor de la Rosa
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: