-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.13
-
Critical
-
No
-
OCPNODE Sprint 238 (Blue), OCPNODE Sprint 239 (Blue)
-
2
-
Rejected
-
False
-
Description of problem:
When running chaos tests to stop all the nodes for > 24 hours and starting them to simulate use cases for customers where they have either limited connectivity for example on ships or they want turn off nodes to save $$$ for few days, we observed that all the nodes are in NotReady state with kubelet not posting the status. We do have a jump host with public ip through nodes ssh can be accessed but the node is going through 100% packet loss, so had to rely on events and pod logs and couldn't get kubelet/dmesg logs to dig deep into the issue. [root@ip-172-31-53-156 ~]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-109-81.us-west-2.compute.internal NotReady workload 5d1h v1.26.5+7d22122 ip-10-0-142-72.us-west-2.compute.internal NotReady control-plane,master 5d1h v1.26.5+7d22122 ip-10-0-147-174.us-west-2.compute.internal NotReady worker 5d1h v1.26.5+7d22122 ip-10-0-177-222.us-west-2.compute.internal NotReady control-plane,master 5d1h v1.26.5+7d22122 ip-10-0-190-14.us-west-2.compute.internal NotReady worker 5d1h v1.26.5+7d22122 ip-10-0-198-10.us-west-2.compute.internal NotReady worker 5d1h v1.26.5+7d22122 ip-10-0-207-195.us-west-2.compute.internal NotReady control-plane,master 5d1h v1.26.5+7d22122 Looking at the events, the nodes seem to have the right specifications set after restart but an oc describe node/<node-name> seems to have the CPU and Zone set. Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 openshift-dns 28s Warning TopologyAwareHintsDisabled service/dns-default Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 It might be related to certificate rotation issues. Logs: Must-gather is not working given that all the nodes are not ready for the pod to get scheduled. Captured the events, oc adm inspect and node info: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.13 cluster on AWS 2. Run Kraken power outage scenario to stop the cluster for > 48 hours: https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md 3. Check the health of the nodes and the cluster
Actual results:
Cluster nodes are unhealthy
Expected results:
Cluster is healthy
Additional info:
Logs: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing