Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Cloud Compute / Unknown
Labels:
- blue
- chaos
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
OCPNODE Sprint 238 (Blue), OCPNODE Sprint 239 (Blue)
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When running chaos tests to stop all the nodes for > 24 hours and starting them to simulate use cases for customers where they have either limited connectivity for example on ships or they want turn off nodes to save $$$ for few days, we observed that all the nodes are in NotReady state with kubelet not posting the status. We do have a jump host with public ip through nodes ssh can be accessed but the node is going through 100% packet loss, so had to rely on events and pod logs and couldn't get kubelet/dmesg logs to dig deep into the issue. 

[root@ip-172-31-53-156 ~]# oc get nodes  NAME                                         STATUS     ROLES                  AGE    VERSION
ip-10-0-109-81.us-west-2.compute.internal    NotReady   workload               5d1h   v1.26.5+7d22122
ip-10-0-142-72.us-west-2.compute.internal    NotReady   control-plane,master   5d1h   v1.26.5+7d22122
ip-10-0-147-174.us-west-2.compute.internal   NotReady   worker                 5d1h   v1.26.5+7d22122
ip-10-0-177-222.us-west-2.compute.internal   NotReady   control-plane,master   5d1h   v1.26.5+7d22122
ip-10-0-190-14.us-west-2.compute.internal    NotReady   worker                 5d1h   v1.26.5+7d22122
ip-10-0-198-10.us-west-2.compute.internal    NotReady   worker                 5d1h   v1.26.5+7d22122
ip-10-0-207-195.us-west-2.compute.internal   NotReady   control-plane,master   5d1h   v1.26.5+7d22122

Looking at the events, the nodes seem to have the right specifications set after restart but an oc describe node/<node-name> seems to have the CPU and Zone set. 
Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4
openshift-dns                          28s         Warning   TopologyAwareHintsDisabled            service/dns-default                                                      Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4

It might be related to certificate rotation issues. 

Logs: Must-gather is not working given that all the nodes are not ready for the pod to get scheduled. Captured the events, oc adm inspect and node info: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.13 cluster on AWS
2. Run Kraken power outage scenario to stop the cluster for > 48 hours: https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md 
3. Check the health of the nodes and the cluster

Actual results:

Cluster nodes are unhealthy

Expected results:

Cluster is healthy

Additional info:

Logs: https://drive.google.com/drive/folders/1DElHnU-VsjhtUi75_w0nTqqGt9jTE3Yy?usp=sharing

Assignee:: Joel Speed

Reporter:: Naga Ravi Chaitanya Elluri

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/06/26 4:35 PM

Updated:: 2025/07/26 5:46 AM

Resolved:: 2023/10/24 2:33 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide