Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Node / Kubelet
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
OCPNODE Sprint 238 (Blue), OCPNODE Sprint 239 (Blue)
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Containers fail to start and end up with this status:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Fri, 09 Jun 2023 10:01:37 -0400
      Finished:     Fri, 09 Jun 2023 10:01:37 -0400

and exit without showing any logs.

For example, restarting a previously "Running" pod puts it into this state:

❯ k get po -n hypershift   
NAME                            READY   STATUS             RESTARTS         AGE
external-dns-7cc4b775d9-t558s   1/1     Running            0                2d20h
operator-58f644b4bb-2rbt4       0/1     CrashLoopBackOff   17 (4m53s ago)   66m
operator-58f644b4bb-w4mkf       1/1     Running            0                2d16h

Version-Release number of selected component (if applicable):

4.12.12

How reproducible:

Unsure

Steps to Reproduce:

Unsure

Actual results:

What we ended up doing on this cluster is replacing the worker machines, which created new nodes. By forcing pods to reschedule onto the new nodes, all pods were able to start successfully.

This behavior was confusingly consistent for some pods - but not all, for example I was able to 

kubectl run ubuntu --image ubuntu --rm -it

and that worked just fine, or deleting/recreating an ovnkube-node pod scheduled on the same worker node, even "oc debug"ing onto an affected worker node and successfully running

podman run --rm -it --entrypoint=bash ${CONTAINER_IMAGE}

for pods that wouldn't start up

Expected results:

In one sense, that pods are able to start successfully (we do not understand what caused this bug)

and on the other hand, that the new and old nodes have the same configuration. It is strange that the new replacement machines are viable, but the existing ones were not.

Additional info:

Must gather link: https://drive.google.com/file/d/10m7TpJEdmBbLec35PD9vpHqoIV8bYzW-/view?usp=sharing

We have living clusters where this bug is still occuring and have cordoned nodes for live investigation if needed, please feel free to reach out if you would like to, we can screenshare!

Assignee:: Sai Ramesh Vanka

Reporter:: Michael Shen (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/06/09 2:20 PM

Updated:: 2025/07/26 11:39 AM

Resolved:: 2023/07/11 8:18 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates