Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33562

Cluster API is not accessible after all nodes are stopped and restarted during chaos testing

XMLWordPrintable

    • Critical
    • Yes
    • Approved
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      There are various use cases where customers prefer to turn off and on their clusters depending on the load or in cases where they have multiple clusters to be cost effective.

      During the chaos testing: https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md, cluster API is not accessible after nodes are stopped and started irrespective of the cluster installation or shutdown time frame. This is a regression and we are able to reproduce it multiple times on 4.16. This problem doesn't exist in previous releases - tested it on 4.14 and 4.15.

       

      The issue might be because of the nodes not getting registered properly after the restart. Logs are not accessible because of the API being down. We will try to create a node with public ip as part of the cluster using custom machineset to be able to ssh and look at the logs.

       

      Version-Release number of selected component (if applicable):

      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.16.0-ec.6   True        False         7h27m   Cluster version is 4.16.0-ec.6

      How reproducible:

      Always

      Steps to Reproduce:

          1. Install a 4.16 cluster using one of the nightly builds or dev-preview releases: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/ on AWS cloud provider ( IPI )
       
          2. Run the outage chaos test using the following commands after setting up the AWS profile for aws-cli to access AWS APIs or you can login into the console and manually turn off the nodes.
             $ export SHUTDOWN_DURATION=60 
             $ export CHECK_CRITICAL_ALERTS=True
             $ podman run --name=outage --net=host --env-host=true -v /root/.kube/config:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
             $ podman logs -f outage
      
          3. Try to access the cluster - $ oc get nodes or any other command after the nodes are back online at the end of the scenario
          

      Actual results:

        Cluster is not accessible - The connection to the server api.ravicluster.aws.rhperfscale.org:6443 was refused - did you specify the right host or port?

      Expected results:

       Cluster APIs are accessible and healthy. 

      Additional info:

          

              vrutkovs@redhat.com Vadim Rutkovsky
              nelluri Naga Ravi Chaitanya Elluri
              Ke Wang Ke Wang
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: