Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5944

SNOs blocked on upgrade because "the cluster operator monitoring has not yet successfully rolled out"

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Attempted upgrade of 3423 SNOs from 4.10.32 to 4.11.5 in large scale ACM/ZTP environment and 9 clusters refused to upgrade because the clusterversion objects were stuck on "Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out"

      Version-Release number of selected component (if applicable):

      SNO OCP 4.10.32 (Clusters with issue) attempting to be upgraded to 4.11.5
      Hub OCP 4.11.19
      ACM Version - 2.7.0-DOWNSTREAM-2023-01-12-20-55-01

      How reproducible:

      9 out of 84 failures for upgrade (~11% of the failures)
      9 out of 3423 clusters attempted to be upgraded

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      # cat platform_nonattempt_monitoring | xargs -I % sh -c "echo -n '% '; oc --kubeconfig=/root/hv-vm/sno/manifests/%/kubeconfig get clusterversion --no-headers"
      sno00269 version   4.10.32   True   False   3d14h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno00339 version   4.10.32   True   False   3d15h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno00585 version   4.10.32   True   False   3d13h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno00740 version   4.10.32   True   False   3d13h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno01839 version   4.10.32   True   False   3d12h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno02881 version   4.10.32   True   False   3d9h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno02986 version   4.10.32   True   False   3d9h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno03030 version   4.10.32   True   False   3d8h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out
      sno03053 version   4.10.32   True   False   3d8h   Error while reconciling 4.10.32: the cluster operator monitoring has not yet successfully rolled out

      Describe run on the monitoring operators:

       

      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00269/kubeconfig describe co monitoring
      Name:         monitoring
      Namespace:
      Labels:       <none>
      Annotations:  include.release.openshift.io/ibm-cloud-managed: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
      API Version:  config.openshift.io/v1
      Kind:         ClusterOperator
      Metadata:
        Creation Timestamp:  2023-01-14T03:38:25Z
        Generation:          1
        Managed Fields:
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                .:
                f:include.release.openshift.io/ibm-cloud-managed:
                f:include.release.openshift.io/self-managed-high-availability:
                f:include.release.openshift.io/single-node-developer:
              f:ownerReferences:
                .:
                k:{"uid":"5a53fb27-4659-406c-b8ea-ce6b4ba103cf"}:
            f:spec:
          Manager:      Go-http-client
          Operation:    Update
          Time:         2023-01-14T03:38:25Z
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:status:
              .:
              f:conditions:
              f:extension:
              f:relatedObjects:
              f:versions:
          Manager:      Go-http-client
          Operation:    Update
          Subresource:  status
          Time:         2023-01-14T04:12:26Z
        Owner References:
          API Version:     config.openshift.io/v1
          Kind:            ClusterVersion
          Name:            version
          UID:             5a53fb27-4659-406c-b8ea-ce6b4ba103cf
        Resource Version:  1376758
        UID:               4049a1c8-0c22-4401-8a59-2a6b49ebcc89
      Spec:
      Status:
        Conditions:
          Last Transition Time:  2023-01-17T19:13:08Z
          Message:               Rolling out the stack.
          Reason:                RollOutInProgress
          Status:                True
          Type:                  Progressing
          Last Transition Time:  2023-01-14T04:41:39Z
          Message:               Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 1 replicas, got 0 updated replicas          
          Reason:                UpdatingPrometheusK8SFailed
          Status:                True
          Type:                  Degraded
          Last Transition Time:  2023-01-14T04:12:26Z
          Status:                True
          Type:                  Upgradeable
          Last Transition Time:  2023-01-14T04:41:39Z
          Message:               Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
          Reason:                UpdatingPrometheusK8SFailed
          Status:                False
          Type:                  Available
        Extension:               <nil>
        Related Objects:
          Group:
          Name:      openshift-monitoring
          Resource:  namespaces
          Group:
          Name:      openshift-user-workload-monitoring
          Resource:  namespaces
          Group:     monitoring.coreos.com
          Name:
          Resource:  servicemonitors
          Group:     monitoring.coreos.com
          Name:
          Resource:  podmonitors
          Group:     monitoring.coreos.com
          Name:
          Resource:  prometheusrules
          Group:     monitoring.coreos.com
          Name:
          Resource:  alertmanagers
          Group:     monitoring.coreos.com
          Name:
          Resource:  prometheuses
          Group:     monitoring.coreos.com
          Name:
          Resource:  thanosrulers
          Group:     monitoring.coreos.com
          Name:
          Resource:  alertmanagerconfigs
        Versions:
          Name:     operator
          Version:  4.10.32
      Events:       <none>
      

      Pods/Deploys/Statefulsets for each of the affected clusters in then openshift-monitoring namespace

       

      # cat platform_nonattempt_monitoring | xargs -I % sh -c "echo '% '; oc --kubeconfig=/root/hv-vm/sno/manifests/%/kubeconfig get po,deploy,sts -n openshift-monitoring"
      sno00269
      NAME                                               READY   STATUS        RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-lsb57   2/2     Running       0          3d15h
      pod/kube-state-metrics-65f656cd75-h2sff            3/3     Running       0          3d15h
      pod/node-exporter-dbn7n                            2/2     Running       0          3d15h
      pod/openshift-state-metrics-7bc54ff57d-mwn9t       3/3     Running       0          3d15h
      pod/prometheus-adapter-56885c749b-cv7cn            0/1     Terminating   0          3d15h
      pod/prometheus-adapter-5b8f744487-f96tk            1/1     Running       0          2d15h
      pod/prometheus-k8s-0                               5/6     Running       0          3d14h
      pod/prometheus-operator-5bcc58f4c6-p9gjl           2/2     Running       0          3d15h
      pod/thanos-querier-654c96c58c-jdp2f                6/6     Running       0          3d14h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d15h
      deployment.apps/kube-state-metrics            1/1     1            1           3d15h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d15h
      deployment.apps/prometheus-adapter            1/1     1            1           3d15h
      deployment.apps/prometheus-operator           1/1     1            1           3d15h
      deployment.apps/thanos-querier                1/1     1            1           3d15h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d15h
      sno00339
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-96vfp   2/2     Running   0          3d15h
      pod/kube-state-metrics-65f656cd75-xtb25            3/3     Running   0          3d15h
      pod/node-exporter-6qqsc                            2/2     Running   0          3d15h
      pod/openshift-state-metrics-7bc54ff57d-vdgn8       3/3     Running   0          3d15h
      pod/prometheus-adapter-7fbcfd64cb-k9w9r            1/1     Running   0          2d15h
      pod/prometheus-k8s-0                               5/6     Running   0          3d14h
      pod/prometheus-operator-5bcc58f4c6-m65d4           2/2     Running   0          3d15h
      pod/thanos-querier-574d6b9d65-qhq2s                6/6     Running   0          3d14h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d15h
      deployment.apps/kube-state-metrics            1/1     1            1           3d15h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d15h
      deployment.apps/prometheus-adapter            1/1     1            1           3d15h
      deployment.apps/prometheus-operator           1/1     1            1           3d15h
      deployment.apps/thanos-querier                1/1     1            1           3d15h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d15h
      sno00585
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-4xdkf   2/2     Running   0          3d14h
      pod/kube-state-metrics-65f656cd75-cnzg5            3/3     Running   0          3d14h
      pod/node-exporter-nkjr4                            2/2     Running   0          3d14h
      pod/openshift-state-metrics-7bc54ff57d-4c5mb       3/3     Running   0          3d14h
      pod/prometheus-adapter-5cbf8c999c-phts8            1/1     Running   0          2d14h
      pod/prometheus-k8s-0                               5/6     Running   0          3d13h
      pod/prometheus-operator-5bcc58f4c6-rsdm8           2/2     Running   0          3d14h
      pod/thanos-querier-86bd4c9689-xpwbm                6/6     Running   0          3d13h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d14h
      deployment.apps/kube-state-metrics            1/1     1            1           3d14h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d14h
      deployment.apps/prometheus-adapter            1/1     1            1           3d14h
      deployment.apps/prometheus-operator           1/1     1            1           3d14h
      deployment.apps/thanos-querier                1/1     1            1           3d14h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d14h
      sno00740
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-d4d2l   2/2     Running   0          3d14h
      pod/kube-state-metrics-65f656cd75-n5d4d            3/3     Running   0          3d13h
      pod/node-exporter-gld9s                            2/2     Running   0          3d13h
      pod/openshift-state-metrics-7bc54ff57d-zq84w       3/3     Running   0          3d13h
      pod/prometheus-adapter-76685f6975-xrxj4            1/1     Running   0          2d14h
      pod/prometheus-k8s-0                               5/6     Running   0          3d13h
      pod/prometheus-operator-5bcc58f4c6-svqvl           2/2     Running   0          3d13h
      pod/thanos-querier-6df9c5d9d8-q9mcx                6/6     Running   0          3d13h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d14h
      deployment.apps/kube-state-metrics            1/1     1            1           3d13h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d13h
      deployment.apps/prometheus-adapter            1/1     1            1           3d13h
      deployment.apps/prometheus-operator           1/1     1            1           3d13h
      deployment.apps/thanos-querier                1/1     1            1           3d13h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d13h
      sno01839
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-fg4jn   2/2     Running   0          3d12h
      pod/kube-state-metrics-65f656cd75-f7ssb            3/3     Running   0          3d12h
      pod/node-exporter-rqjxb                            2/2     Running   0          3d12h
      pod/openshift-state-metrics-7bc54ff57d-rc4l8       3/3     Running   0          3d12h
      pod/prometheus-adapter-d69ddd58f-czns5             1/1     Running   0          2d12h
      pod/prometheus-k8s-0                               5/6     Running   0          3d11h
      pod/prometheus-operator-5bcc58f4c6-vrlgn           2/2     Running   0          3d12h
      pod/thanos-querier-59d66f4fbc-p2j88                6/6     Running   0          3d11h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d12h
      deployment.apps/kube-state-metrics            1/1     1            1           3d12h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d12h
      deployment.apps/prometheus-adapter            1/1     1            1           3d12h
      deployment.apps/prometheus-operator           1/1     1            1           3d12h
      deployment.apps/thanos-querier                1/1     1            1           3d12h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d12h
      sno02881
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-ct7wn   2/2     Running   0          3d10h
      pod/kube-state-metrics-65f656cd75-v6fwk            3/3     Running   0          3d9h
      pod/node-exporter-94dfj                            2/2     Running   0          3d9h
      pod/openshift-state-metrics-7bc54ff57d-vflml       3/3     Running   0          3d9h
      pod/prometheus-adapter-86d8779cf5-2nnjk            1/1     Running   0          2d10h
      pod/prometheus-k8s-0                               5/6     Running   0          3d9h
      pod/prometheus-operator-5bcc58f4c6-hlxvl           2/2     Running   0          3d9h
      pod/thanos-querier-5868669ccc-xtgxl                6/6     Running   0          3d9h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d10h
      deployment.apps/kube-state-metrics            1/1     1            1           3d9h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d9h
      deployment.apps/prometheus-adapter            1/1     1            1           3d9h
      deployment.apps/prometheus-operator           1/1     1            1           3d9h
      deployment.apps/thanos-querier                1/1     1            1           3d9h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d9h
      sno02986
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-w45s9   2/2     Running   0          3d9h
      pod/kube-state-metrics-65f656cd75-mc4hf            3/3     Running   0          3d9h
      pod/node-exporter-trpm9                            2/2     Running   0          3d9h
      pod/openshift-state-metrics-7bc54ff57d-mj2dq       3/3     Running   0          3d9h
      pod/prometheus-adapter-7d49897cbc-kp5gq            1/1     Running   0          2d10h
      pod/prometheus-k8s-0                               5/6     Running   0          3d8h
      pod/prometheus-operator-5bcc58f4c6-6qh77           2/2     Running   0          3d9h
      pod/thanos-querier-975b69457-55rkg                 6/6     Running   0          3d8h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d10h
      deployment.apps/kube-state-metrics            1/1     1            1           3d9h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d9h
      deployment.apps/prometheus-adapter            1/1     1            1           3d9h
      deployment.apps/prometheus-operator           1/1     1            1           3d9h
      deployment.apps/thanos-querier                1/1     1            1           3d9h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d9h
      sno03030
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-bn94z   2/2     Running   0          3d9h
      pod/kube-state-metrics-65f656cd75-42lkj            3/3     Running   0          3d8h
      pod/node-exporter-48xz2                            2/2     Running   0          3d8h
      pod/openshift-state-metrics-7bc54ff57d-q28cm       3/3     Running   0          3d8h
      pod/prometheus-adapter-5bcbfdc959-wrhgv            1/1     Running   0          2d9h
      pod/prometheus-k8s-0                               5/6     Running   0          3d7h
      pod/prometheus-operator-5bcc58f4c6-ql2hw           2/2     Running   0          3d8h
      pod/thanos-querier-69dbf8c49b-mkcf5                6/6     Running   0          3d7h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d9h
      deployment.apps/kube-state-metrics            1/1     1            1           3d8h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d8h
      deployment.apps/prometheus-adapter            1/1     1            1           3d8h
      deployment.apps/prometheus-operator           1/1     1            1           3d8h
      deployment.apps/thanos-querier                1/1     1            1           3d8h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d8h
      sno03053
      NAME                                               READY   STATUS    RESTARTS   AGE
      pod/cluster-monitoring-operator-556f6847dd-gbgxv   2/2     Running   0          3d9h
      pod/kube-state-metrics-65f656cd75-8k4hc            3/3     Running   0          3d8h
      pod/node-exporter-wgntq                            2/2     Running   0          3d8h
      pod/openshift-state-metrics-7bc54ff57d-mhfrd       3/3     Running   0          3d8h
      pod/prometheus-adapter-ffb468559-9kxgh             1/1     Running   0          2d9h
      pod/prometheus-k8s-0                               5/6     Running   0          3d7h
      pod/prometheus-operator-5bcc58f4c6-b265g           2/2     Running   0          3d8h
      pod/thanos-querier-84bf9c8645-gv6bz                6/6     Running   0          3d7h
      NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/cluster-monitoring-operator   1/1     1            1           3d9h
      deployment.apps/kube-state-metrics            1/1     1            1           3d8h
      deployment.apps/openshift-state-metrics       1/1     1            1           3d8h
      deployment.apps/prometheus-adapter            1/1     1            1           3d8h
      deployment.apps/prometheus-operator           1/1     1            1           3d8h
      deployment.apps/thanos-querier                1/1     1            1           3d8h
      NAME                              READY   AGE
      statefulset.apps/prometheus-k8s   0/1     3d8h
      

       

      Attachments

        1. must-gather-sno00269.tar.gz
          64.75 MB
          Alex Krzos
        2. must-gather-sno00339.tar.gz
          64.23 MB
          Alex Krzos
        3. sosreport-sno00269-2023-01-17-ketangs.tar.xz
          31.56 MB
          Alex Krzos
        4. sosreport-sno00339-2023-01-17-cuexesl.tar.xz
          31.61 MB
          Alex Krzos

        Activity

          People

            rphillip@redhat.com Ryan Phillips
            akrzos@redhat.com Alex Krzos
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: