-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
-
While resuming from hibernation, an internal production cluster fired a ClusterHasGoneMissing alert. Upon investigating, the only nodes that were in a Ready state were the master nodes:
[tnierman@tnierman-thinkpadp1gen5] >> oc get no E1003 15:38:47.246605 414447 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:38:54.604838 414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:38:54.722493 414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:38:54.840969 414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request NAME STATUS ROLES AGE VERSION ip-10-0-129-232.us-west-2.compute.internal Ready control-plane,master 7d14h v1.26.7+0ef5eae ip-10-0-129-7.us-west-2.compute.internal NotReady infra,worker 7d14h v1.26.7+0ef5eae ip-10-0-144-47.us-west-2.compute.internal NotReady infra,worker 7d14h v1.26.7+0ef5eae ip-10-0-156-168.us-west-2.compute.internal Ready control-plane,master 7d14h v1.26.7+0ef5eae ip-10-0-161-53.us-west-2.compute.internal Ready control-plane,master 7d14h v1.26.7+0ef5eae ip-10-0-169-246.us-west-2.compute.internal NotReady infra,worker 7d14h v1.26.7+0ef5eae
This resulted in several ClusterOperators reporting a degraded status, including monitoring, which triggered the CHGM alert:
[tnierman@tnierman-thinkpadp1gen5] >> oc get co E1003 15:39:51.739154 414499 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:39:52.199079 414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:39:52.318105 414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E1003 15:39:52.438763 414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.11 False False True 70m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.bchandra-ore.j526.p1.openshiftapps.com/healthz": EOF baremetal 4.13.11 True False False 7d14h cloud-controller-manager 4.13.11 True False False 7d14h cloud-credential 4.13.11 True False False 7d14h cluster-autoscaler 4.13.11 True False False 7d14h config-operator 4.13.11 True False False 7d14h console 4.13.11 False True False 70m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.13.11 True False False 7d13h csi-snapshot-controller 4.13.11 True False False 7d14h dns 4.13.11 True True False 7d14h DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6." etcd 4.13.11 True False False 7d14h image-registry 4.13.11 False True True 70m Available: The deployment does not have available replicas... ingress 4.13.11 False True True 73m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) insights 4.13.11 True False False 7d14h kube-apiserver 4.13.11 True False False 7d14h kube-controller-manager 4.13.11 True False False 7d14h kube-scheduler 4.13.11 True False False 7d14h kube-storage-version-migrator 4.13.11 True False False 73m machine-api 4.13.11 True False False 7d14h machine-approver 4.13.11 True False False 7d14h machine-config 4.13.11 False False True 61m Cluster not available for [{operator 4.13.11}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 3, unavailable: 3)] marketplace 4.13.11 True False False 7d14h monitoring 4.13.11 False True True 63m reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas network 4.13.11 True True False 7d14h DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 3 nodes)... node-tuning 4.13.11 True True False 7d1h Waiting for 3/6 Profiles to be applied openshift-apiserver 4.13.11 True False False 7d14h openshift-controller-manager 4.13.11 True False False 7d14h openshift-samples 4.13.11 True False False 7d1h operator-lifecycle-manager 4.13.11 True False False 7d14h operator-lifecycle-manager-catalog 4.13.11 True False False 7d14h operator-lifecycle-manager-packageserver 4.13.11 True False False 70m service-ca 4.13.11 True False False 7d14h storage 4.13.11 True True False 7d14h AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods