Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
- aws
- hibernation
- machine
- node

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

While resuming from hibernation, an internal production cluster fired a ClusterHasGoneMissing alert. Upon investigating, the only nodes that were in a Ready state were the master nodes:

[tnierman@tnierman-thinkpadp1gen5] >> oc get no
E1003 15:38:47.246605  414447 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:38:54.604838  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:38:54.722493  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:38:54.840969  414447 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME                                         STATUS     ROLES                  AGE     VERSION
ip-10-0-129-232.us-west-2.compute.internal   Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
ip-10-0-129-7.us-west-2.compute.internal     NotReady   infra,worker           7d14h   v1.26.7+0ef5eae
ip-10-0-144-47.us-west-2.compute.internal    NotReady   infra,worker           7d14h   v1.26.7+0ef5eae
ip-10-0-156-168.us-west-2.compute.internal   Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
ip-10-0-161-53.us-west-2.compute.internal    Ready      control-plane,master   7d14h   v1.26.7+0ef5eae
ip-10-0-169-246.us-west-2.compute.internal   NotReady   infra,worker           7d14h   v1.26.7+0ef5eae

This resulted in several ClusterOperators reporting a degraded status, including monitoring, which triggered the CHGM alert:

[tnierman@tnierman-thinkpadp1gen5] >> oc get co
E1003 15:39:51.739154  414499 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:39:52.199079  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:39:52.318105  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1003 15:39:52.438763  414499 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.11   False       False         True       70m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.bchandra-ore.j526.p1.openshiftapps.com/healthz": EOF
baremetal                                  4.13.11   True        False         False      7d14h   
cloud-controller-manager                   4.13.11   True        False         False      7d14h   
cloud-credential                           4.13.11   True        False         False      7d14h   
cluster-autoscaler                         4.13.11   True        False         False      7d14h   
config-operator                            4.13.11   True        False         False      7d14h   
console                                    4.13.11   False       True          False      70m     DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set                  4.13.11   True        False         False      7d13h   
csi-snapshot-controller                    4.13.11   True        False         False      7d14h   
dns                                        4.13.11   True        True          False      7d14h   DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6."
etcd                                       4.13.11   True        False         False      7d14h   
image-registry                             4.13.11   False       True          True       70m     Available: The deployment does not have available replicas...
ingress                                    4.13.11   False       True          True       73m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.13.11   True        False         False      7d14h   
kube-apiserver                             4.13.11   True        False         False      7d14h   
kube-controller-manager                    4.13.11   True        False         False      7d14h   
kube-scheduler                             4.13.11   True        False         False      7d14h   
kube-storage-version-migrator              4.13.11   True        False         False      73m     
machine-api                                4.13.11   True        False         False      7d14h   
machine-approver                           4.13.11   True        False         False      7d14h   
machine-config                             4.13.11   False       False         True       61m     Cluster not available for [{operator 4.13.11}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 3, unavailable: 3)]
marketplace                                4.13.11   True        False         False      7d14h   
monitoring                                 4.13.11   False       True          True       63m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
network                                    4.13.11   True        True          False      7d14h   DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 3 nodes)...
node-tuning                                4.13.11   True        True          False      7d1h    Waiting for 3/6 Profiles to be applied
openshift-apiserver                        4.13.11   True        False         False      7d14h   
openshift-controller-manager               4.13.11   True        False         False      7d14h   
openshift-samples                          4.13.11   True        False         False      7d1h    
operator-lifecycle-manager                 4.13.11   True        False         False      7d14h   
operator-lifecycle-manager-catalog         4.13.11   True        False         False      7d14h   
operator-lifecycle-manager-packageserver   4.13.11   True        False         False      70m     
service-ca                                 4.13.11   True        False         False      7d14h   
storage                                    4.13.11   True        True          False      7d14h   AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates