-
Bug
-
Resolution: Done
-
Major
-
None
-
CNV v4.16.0
-
0.42
-
False
-
-
False
-
---
-
---
-
-
Moderate
-
None
Description of problem:
When a worker node is disrupted - stopped and started, the node takes 15 minutes to get back to Ready after the disrupted node is back online on the infrastructure side. During that 15 minutes, all the VMIs/user workload running on the node are not ready i.e downtime for the user. Node downtime log Normal NodeNotReady 20m (x2 over 70m) node-controller Node cc37-h35-000-r750 status is now: NodeNotReady Normal NodeUnresponsive 17m node-controller virt-handler is not responsive, marking node as unresponsive Normal Starting 83s kubelet Starting kubelet. Normal Starting 73s kubelet Starting kubelet. Normal NodeAllocatableEnforced 72s kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 72s (x2 over 72s) kubelet Node cc37-h35-000-r750 status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 72s (x2 over 72s) kubelet Node cc37-h35-000-r750 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 72s (x2 over 72s) kubelet Node cc37-h35-000-r750 status is now: NodeHasSufficientPID Warning Rebooted 72s kubelet Node cc37-h35-000-r750 has been rebooted, boot id: d9486e1d-2e00-4337-ba07-2ec85917952e Normal NodeNotReady 72s kubelet Node cc37-h35-000-r750 status is now: NodeNotReady Normal NodeReady 33s kubelet Node cc37-h35-000-r750 status is now: NodeReady VMIs status NAME AGE PHASE IP NODENAME READY windows-vm-8a4120d4-0 24m Running 10.128.2.19 cc37-h35-000-r750 False windows-vm-8a4120d4-1 24m Running 10.128.2.20 cc37-h35-000-r750 False windows-vm-8a4120d4-10 24m Running 10.128.2.31 cc37-h35-000-r750 False After the node gets back to Ready state on the OpenShift, all the VMIs got migrated/moved to the other worker nodes. This is leading to target node being underutilized while other worker nodes will be overloaded depending on the user load. It might be better to migrate them during the outage to avoid extended downtime instead.
Version-Release number of selected component (if applicable):
4.16.6
How reproducible:
Always
Steps to Reproduce:
- Install CNV on a 4.16 baremetal cluster with user workload/VMIs running
- Use https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-scenarios.md to disrupt/stop and start one of the worker nodes
- Observe the status of the node from both Kubernetes/OpenShift side as well as the infrastructure side
- Observe the status of the VMIs running on the targeted node
Actual results:
1. 15 mins for the node to get to Ready phase on the Kubernetes/OpenShift side and VMIs are all not Ready during that period 2. VMIs got migrated after the node is Ready leading to under utilization on the targeted node while other worker nodes got overloaded.
Expected results:
Additional info:
1. Node should recover fast to avoid VMIs downtime to the user 2. Avoid migrating the VMIs after the node is back to Ready phase. Instead migrate them during the outage to avoid extended VMIs downtime where possible.
- relates to
-
CNV-44356 RnD what we need to validate before running VMs
- New