-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
0.42
-
False
-
-
False
-
None
-
---
-
---
-
-
High
-
None
Description of problem:
During a node outage on a CNV cluster, ssp-operator encounters 5 minutes of downtime because of the controller defaults: Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s ssp-operator log: Error from server: Get "https://198.18.10.9:10250/containerLogs/openshift-cnv/ssp-operator-fb4ff67db-cfv29/manager": dial tcp 198.18.10.9:10250: connect: no route to host Impact of the outage: Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.
Version-Release number of selected component (if applicable):
[root@cc37-h25-000-r750 ssp]# cat clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.22 True False 20h Cluster version is 4.15.22 [root@cc37-h25-000-r750 ssp]# cat cnv-version NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.15.5 OpenShift Virtualization 4.15.5 kubevirt-hyperconverged-operator.v4.15.4 Installing
How reproducible:
Always
Steps to Reproduce:
1. Install a baremetal OCP cluster with CNV 2. Inject a node outage on which ssp-operator is running using krkn - https://github.com/krkn-chaos/krkn-hub/blob/main/docs/node-scenarios.md 3. Observe the impact on ssp-operator
Actual results:
ssp-operator is down until controller reschedules it on another node after 5 minutes
Expected results:
HA for ssp-operator to avoid extended downtime and avoid the impact: dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.
Additional info: