Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: CNV v4.16.0
Component/s: CNV Infrastructure
Labels:
- chaos

Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
[QE] How to address?:
---
[QE] Why QE missed?:
---
Market:

Severity:
Moderate

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

When a worker node is disrupted - stopped and started, the node takes 15 minutes to get back to Ready after the disrupted node is back online on the infrastructure side. During that 15 minutes, all the VMIs/user workload running on the node are not ready i.e downtime for the user. 

Node downtime log 
  Normal   NodeNotReady             20m (x2 over 70m)       node-controller      Node cc37-h35-000-r750 status is now: NodeNotReady
  Normal   NodeUnresponsive         17m                     node-controller      virt-handler is not responsive, marking node as unresponsive
  Normal   Starting                 83s                     kubelet              Starting kubelet.
  Normal   Starting                 73s                     kubelet              Starting kubelet.
  Normal   NodeAllocatableEnforced  72s                     kubelet              Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasSufficientPID
  Warning  Rebooted                 72s                     kubelet              Node cc37-h35-000-r750 has been rebooted, boot id: d9486e1d-2e00-4337-ba07-2ec85917952e
  Normal   NodeNotReady             72s                     kubelet              Node cc37-h35-000-r750 status is now: NodeNotReady
  Normal   NodeReady                33s                     kubelet              Node cc37-h35-000-r750 status is now: NodeReady

VMIs status

NAME                      AGE   PHASE     IP             NODENAME            READY
windows-vm-8a4120d4-0     24m   Running   10.128.2.19    cc37-h35-000-r750   False
windows-vm-8a4120d4-1     24m   Running   10.128.2.20    cc37-h35-000-r750   False
windows-vm-8a4120d4-10    24m   Running   10.128.2.31    cc37-h35-000-r750   False


After the node gets back to Ready state on the OpenShift, all the VMIs got migrated/moved to the other worker nodes. This is leading to target node being underutilized while other worker nodes will be overloaded depending on the user load. It might be better to migrate them during the outage to avoid extended downtime instead.

Version-Release number of selected component (if applicable):

4.16.6

How reproducible:

  Always

Steps to Reproduce:

Install CNV on a 4.16 baremetal cluster with user workload/VMIs running
Use https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-scenarios.md to disrupt/stop and start one of the worker nodes
Observe the status of the node from both Kubernetes/OpenShift side as well as the infrastructure side
Observe the status of the VMIs running on the targeted node

Actual results:

1. 15 mins for the node to get to Ready phase on the Kubernetes/OpenShift side and VMIs are all not Ready during that period
2. VMIs got migrated after the node is Ready leading to under utilization on the targeted node while other worker nodes got overloaded.

Expected results:

Additional info:

1. Node should recover fast to avoid VMIs downtime to the user
2. Avoid migrating the VMIs after the node is back to Ready phase. Instead migrate them during the outage to avoid extended VMIs downtime where possible.

relates to

CNV-44356 RnD what we need to validate before running VMs

Assignee:: Geetika Kapoor

Reporter:: Naga Ravi Chaitanya Elluri

QA Contact:: Geetika Kapoor

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/08/27 2:15 AM

Updated:: 2024/12/18 8:17 AM

Resolved:: 2024/10/16 12:45 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates