Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-47258

Replicate ssp-operator for higher availability to avoid 5 minutes downtime during a node outage

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • CNV Infrastructure
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • ---
    • ---
    • High
    • None

      Description of problem:

      During a node outage on a CNV cluster, ssp-operator encounters 5 minutes of downtime because of the controller defaults:
      
      Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      
      ssp-operator log:
      Error from server: Get "https://198.18.10.9:10250/containerLogs/openshift-cnv/ssp-operator-fb4ff67db-cfv29/manager": dial tcp 198.18.10.9:10250: connect: no route to host
      
      Impact of the outage:
      Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.

      Version-Release number of selected component (if applicable):

      [root@cc37-h25-000-r750 ssp]# cat clusterversion 
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.22   True        False         20h     Cluster version is 4.15.22
      
      [root@cc37-h25-000-r750 ssp]# cat cnv-version 
      NAME                                       DISPLAY                    VERSION   REPLACES                                   PHASE
      kubevirt-hyperconverged-operator.v4.15.5   OpenShift Virtualization   4.15.5    kubevirt-hyperconverged-operator.v4.15.4   Installing

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install a baremetal OCP cluster with CNV
      2. Inject a node outage on which ssp-operator is running using krkn - https://github.com/krkn-chaos/krkn-hub/blob/main/docs/node-scenarios.md
      3. Observe the impact on ssp-operator 
      

      Actual results:

      ssp-operator is down until controller reschedules it on another node after 5 minutes

      Expected results:

      HA for ssp-operator to avoid extended downtime and avoid the impact: dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail.

      Additional info:

       

              akrejcir@redhat.com Andrej Krejcir
              nelluri Naga Ravi Chaitanya Elluri
              Geetika Kapoor Geetika Kapoor
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: