Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30970

Upgrade from 4.15 to 4.16 fails because of kubelet reporting "Failed to register CRI auth plugins" error

    XMLWordPrintable

Details

    • Important
    • No
    • CLOUD Sprint 252
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Adds gcr and acr credential providers to 4.15 for RHEL. This allows the upgrade to complete on IPI with RHEL worker OCP clusters. The ordering of the upgrade process could lead to a state where the upgrade was not completed on the RHEL workers, but the cluster upgrade was, meaning the required packages were not present and kubelet fails to start.
      Show
      Adds gcr and acr credential providers to 4.15 for RHEL. This allows the upgrade to complete on IPI with RHEL worker OCP clusters. The ordering of the upgrade process could lead to a state where the upgrade was not completed on the RHEL workers, but the cluster upgrade was, meaning the required packages were not present and kubelet fails to start.
    • Bug Fix
    • In Progress

    Description

      Description of problem:

      Upgrade from 4.15 to 4.16 is failing because kubelet reports this error:
      
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411346    7755 kubelet.go:308] "Adding static pod path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411380    7755 file.go:69] "Watching path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411406    7755 kubelet.go:319] "Adding apiserver pod source"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411426    7755 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.414274    7755 kuberuntime_manager.go:257] "Container runtime initialized" containerRuntime="cri-o" version="1.28.4-4.rhaos4.15.git92d1839.el8" apiVersion="v1"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: E0315 17:03:31.414963    7755 kuberuntime_manager.go:273] "Failed to register CRI auth plugins" err="plugin binary executable /usr/libexec/kubelet-image-credential-provider-plugins/acr-credential-provider did not exist"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Consumed 155ms CPU time
      
      
      
      
      
      We have seen this issue in prow job periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-workers-rhel8-f28 (a cluster with rhel workers) and in manual upgrades in IPI on GCP clusters (a cluster with coreos workers).
      
          

      Version-Release number of selected component (if applicable):

       Upgrade from 4.15.3 to 4.16.0-0.nightly-2024-03-13-061822
      
      oc get clusterversion -o yaml
      ...
          history:
          - acceptedRisks: |-
              Target release version="" image="registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b against keyrings: verifier-public-key-redhat
              [2024-03-15T15:33:11Z: prefix sha256-da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b in config map signatures-managed: no more signatures to check, 2024-03-15T15:33:11Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b/signature-1: no more signatures to check, 2024-03-15T15:33:11Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b/signature-1: no more signatures to check, 2024-03-15T15:33:11Z: parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2024-03-15T15:33:11Z: serial signature store wrapping ClusterVersion signatureStores unset, falling back to default stores, parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2024-03-15T15:33:11Z: serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", serial signature store wrapping ClusterVersion signatureStores unset, falling back to default stores, parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check]
              Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.15.3 to 4.16.0-0.nightly-2024-03-13-061822 is probably neither recommended nor supported.
            completionTime: null
            image: registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:da22f0582a13f19aae1792c6de2e3cc348c3ed1af67c1fbb5a9960833931341b
            startedTime: "2024-03-15T15:33:28Z"
            state: Partial
            verified: false
            version: 4.16.0-0.nightly-2024-03-13-061822
          - completionTime: "2024-03-15T13:33:08Z"
            image: registry.build04.ci.openshift.org/ci-op-wb5fkm5k/release@sha256:8e8c6c2645553e6df8eb7985d8cb322f333a4152453e2aa85fff24ac5e0755b0
            startedTime: "2024-03-15T13:02:04Z"
            state: Completed
            verified: false
            version: 4.15.3
      
      
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Upgrade from 4.15 to 4.16 using prow job periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-workers-rhel8-f28 or an IPI on GCP cluster.
          
          

      Actual results:

      Worker nodes do not join the cluster when they are rebooted:
      
      sh-4.4$ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-b566c3af4e215e2a77e6f9d9e5a988de   True      False      False      3              3                   3                     0                      3h59m
      worker   rendered-worker-21862c92d0f14a4842f6093f65571bd1   False     True       False      3              0                   0                     0                      3h59m
      
      sh-4.4$ oc get nodes
      NAME                                  STATUS                        ROLES                  AGE     VERSION
      ci-op-wb5fkm5k-e450c-s6m96-master-0   Ready                         control-plane,master   4h5m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-master-1   Ready                         control-plane,master   4h6m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-master-2   Ready                         control-plane,master   4h6m    v1.29.2+a0beecc
      ci-op-wb5fkm5k-e450c-s6m96-rhel-1     NotReady,SchedulingDisabled   worker                 3h17m   v1.28.7+6e2789b
      ci-op-wb5fkm5k-e450c-s6m96-rhel-2     Ready                         worker                 3h17m   v1.28.7+6e2789b
      ci-op-wb5fkm5k-e450c-s6m96-rhel-3     Ready                         worker                 3h17m   v1.28.7+6e2789b
      
      In the NotReady node we can see this error in kubelet
      
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411346    7755 kubelet.go:308] "Adding static pod path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411380    7755 file.go:69] "Watching path" path="/etc/kubernetes/manifests"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411406    7755 kubelet.go:319] "Adding apiserver pod source"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.411426    7755 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: I0315 17:03:31.414274    7755 kuberuntime_manager.go:257] "Container runtime initialized" containerRuntime="cri-o" version="1.28.4-4.rhaos4.15.git92d1839.el8" apiVersion="v1"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 kubenswrapper[7755]: E0315 17:03:31.414963    7755 kuberuntime_manager.go:273] "Failed to register CRI auth plugins" err="plugin binary executable /usr/libexec/kubelet-image-credential-provider-plugins/acr-credential-provider did not exist"
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: Failed to start Kubernetes Kubelet.
      Mar 15 17:03:31 ci-op-wb5fkm5k-e450c-s6m96-rhel-1 systemd[1]: kubelet.service: Consumed 155ms CPU time
      
          

      Expected results:

      The upgrade should be executed without failures
      
          

      Additional info:

      In the first comment you can find the must-gather file and the journal.logs
      
          

      Attachments

        Activity

          People

            rh-ee-tbarberb Theo Barber-Bany
            sregidor@redhat.com Sergio Regidor de la Rosa
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated: