Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15433

Cluster nodes are all unhealthy after stopped for > 48 hours leading to unusable state

XMLWordPrintable

    • Critical
    • No
    • OCPNODE Sprint 238 (Blue), OCPNODE Sprint 239 (Blue)
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When running chaos tests to stop all the nodes for > 24 hours and starting them to simulate use cases for customers where they have either limited connectivity for example on ships or they want turn off nodes to save $$$ for few days, we observed that all the nodes are in NotReady state with kubelet not posting the status. We do have a jump host with public ip through nodes ssh can be accessed but the node is going through 100% packet loss, so had to rely on events and pod logs and couldn't get kubelet/dmesg logs to dig deep into the issue. 
      
      [root@ip-172-31-53-156 ~]# oc get nodes  NAME                                         STATUS     ROLES                  AGE    VERSION
      ip-10-0-109-81.us-west-2.compute.internal    NotReady   workload               5d1h   v1.26.5+7d22122
      ip-10-0-142-72.us-west-2.compute.internal    NotReady   control-plane,master   5d1h   v1.26.5+7d22122
      ip-10-0-147-174.us-west-2.compute.internal   NotReady   worker                 5d1h   v1.26.5+7d22122
      ip-10-0-177-222.us-west-2.compute.internal   NotReady   control-plane,master   5d1h   v1.26.5+7d22122
      ip-10-0-190-14.us-west-2.compute.internal    NotReady   worker                 5d1h   v1.26.5+7d22122
      ip-10-0-198-10.us-west-2.compute.internal    NotReady   worker                 5d1h   v1.26.5+7d22122
      ip-10-0-207-195.us-west-2.compute.internal   NotReady   control-plane,master   5d1h   v1.26.5+7d22122
      
      Looking at the events, the nodes seem to have the right specifications set after restart but an oc describe node/<node-name> seems to have the CPU and Zone set. 
      Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4
      openshift-dns                          28s         Warning   TopologyAwareHintsDisabled            service/dns-default                                                      Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4
      
      It might be related to certificate rotation issues. 
      
      Logs: Must-gather is not working given that all the nodes are not ready for the pod to get scheduled. Captured the events, oc adm inspect and node info: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing. 

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install a 4.13 cluster on AWS
      2. Run Kraken power outage scenario to stop the cluster for > 48 hours: https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md 
      3. Check the health of the nodes and the cluster
      

      Actual results:

      Cluster nodes are unhealthy

      Expected results:

      Cluster is healthy

      Additional info:

      Logs: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing

              joelspeed Joel Speed
              nelluri Naga Ravi Chaitanya Elluri
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: