Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2322

Cluster failed to resume from hibernation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • False
    • None
    • False

      While resuming from hibernation, an internal production cluster fired a ClusterHasGoneMissing alert. Upon investigating, the only nodes that were in a Ready state were the master nodes:

      [tnierman@tnierman-thinkpadp1gen5] >> oc get no
      E1003 15:38:47.246605  414447 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:38:54.604838  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:38:54.722493  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:38:54.840969  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      NAME                                         STATUS     ROLES                  AGE     VERSION
      ip-10-0-129-232.us-west-2.compute.internal   Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
      ip-10-0-129-7.us-west-2.compute.internal     NotReady   infra,worker           7d14h   v1.26.7+0ef5eae
      ip-10-0-144-47.us-west-2.compute.internal    NotReady   infra,worker           7d14h   v1.26.7+0ef5eae
      ip-10-0-156-168.us-west-2.compute.internal   Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
      ip-10-0-161-53.us-west-2.compute.internal    Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
      ip-10-0-169-246.us-west-2.compute.internal   NotReady   infra,worker           7d14h   v1.26.7+0ef5eae
      

      This resulted in several ClusterOperators reporting a degraded status, including monitoring, which triggered the CHGM alert:

      [tnierman@tnierman-thinkpadp1gen5] >> oc get co
      E1003 15:39:51.739154  414499 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:39:52.199079  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:39:52.318105  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      E1003 15:39:52.438763  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.11   False       False         True       70m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.bchandra-ore.j526.p1.openshiftapps.com/healthz": EOF
      baremetal                                  4.13.11   True        False         False      7d14h   
      cloud-controller-manager                   4.13.11   True        False         False      7d14h   
      cloud-credential                           4.13.11   True        False         False      7d14h   
      cluster-autoscaler                         4.13.11   True        False         False      7d14h   
      config-operator                            4.13.11   True        False         False      7d14h   
      console                                    4.13.11   False       True          False      70m     DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.13.11   True        False         False      7d13h   
      csi-snapshot-controller                    4.13.11   True        False         False      7d14h   
      dns                                        4.13.11   True        True          False      7d14h   DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6."
      etcd                                       4.13.11   True        False         False      7d14h   
      image-registry                             4.13.11   False       True          True       70m     Available: The deployment does not have available replicas...
      ingress                                    4.13.11   False       True          True       73m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
      insights                                   4.13.11   True        False         False      7d14h   
      kube-apiserver                             4.13.11   True        False         False      7d14h   
      kube-controller-manager                    4.13.11   True        False         False      7d14h   
      kube-scheduler                             4.13.11   True        False         False      7d14h   
      kube-storage-version-migrator              4.13.11   True        False         False      73m     
      machine-api                                4.13.11   True        False         False      7d14h   
      machine-approver                           4.13.11   True        False         False      7d14h   
      machine-config                             4.13.11   False       False         True       61m     Cluster not available for [{operator 4.13.11}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 3, unavailable: 3)]
      marketplace                                4.13.11   True        False         False      7d14h   
      monitoring                                 4.13.11   False       True          True       63m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
      network                                    4.13.11   True        True          False      7d14h   DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 3 nodes)...
      node-tuning                                4.13.11   True        True          False      7d1h    Waiting for 3/6 Profiles to be applied
      openshift-apiserver                        4.13.11   True        False         False      7d14h   
      openshift-controller-manager               4.13.11   True        False         False      7d14h   
      openshift-samples                          4.13.11   True        False         False      7d1h    
      operator-lifecycle-manager                 4.13.11   True        False         False      7d14h   
      operator-lifecycle-manager-catalog         4.13.11   True        False         False      7d14h   
      operator-lifecycle-manager-packageserver   4.13.11   True        False         False      70m     
      service-ca                                 4.13.11   True        False         False      7d14h   
      storage                                    4.13.11   True        True          False      7d14h   AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      

              efried.openshift Eric Fried
              tnierman.openshift Trevor Nierman
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: