Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: CNV Infrastructure
Labels:
- chaos

Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
[QE] How to address?:
---
[QE] Why QE missed?:
---
Market:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

During a node outage on a CNV cluster, ssp-operator encounters 5 minutes of downtime because of the controller defaults:

Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

ssp-operator log:
Error from server: Get "https://198.18.10.9:10250/containerLogs/openshift-cnv/ssp-operator-fb4ff67db-cfv29/manager": dial tcp 198.18.10.9:10250: connect: no route to host

Impact of the outage:
Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.

Version-Release number of selected component (if applicable):

[root@cc37-h25-000-r750 ssp]# cat clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.22   True        False         20h     Cluster version is 4.15.22

[root@cc37-h25-000-r750 ssp]# cat cnv-version 
NAME                                       DISPLAY                    VERSION   REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.15.5   OpenShift Virtualization   4.15.5    kubevirt-hyperconverged-operator.v4.15.4   Installing

How reproducible:

Always

Steps to Reproduce:

1. Install a baremetal OCP cluster with CNV
2. Inject a node outage on which ssp-operator is running using krkn - https://github.com/krkn-chaos/krkn-hub/blob/main/docs/node-scenarios.md
3. Observe the impact on ssp-operator

Actual results:

ssp-operator is down until controller reschedules it on another node after 5 minutes

Expected results:

HA for ssp-operator to avoid extended downtime and avoid the impact: dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.

Additional info:

Assignee:: Andrej Krejcir

Reporter:: Naga Ravi Chaitanya Elluri

QA Contact:: Geetika Kapoor

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/29 4:38 PM

Updated:: 2024/12/18 8:17 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates