-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.16
-
Critical
-
Yes
-
Approved
-
False
-
Description of problem:
There are various use cases where customers prefer to turn off and on their clusters depending on the load or in cases where they have multiple clusters to be cost effective.
During the chaos testing: https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md, cluster API is not accessible after nodes are stopped and started irrespective of the cluster installation or shutdown time frame. This is a regression and we are able to reproduce it multiple times on 4.16. This problem doesn't exist in previous releases - tested it on 4.14 and 4.15.
The issue might be because of the nodes not getting registered properly after the restart. Logs are not accessible because of the API being down. We will try to create a node with public ip as part of the cluster using custom machineset to be able to ssh and look at the logs.
Version-Release number of selected component (if applicable):
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.6 True False 7h27m Cluster version is 4.16.0-ec.6
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.16 cluster using one of the nightly builds or dev-preview releases: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/ on AWS cloud provider ( IPI ) 2. Run the outage chaos test using the following commands after setting up the AWS profile for aws-cli to access AWS APIs or you can login into the console and manually turn off the nodes. $ export SHUTDOWN_DURATION=60 $ export CHECK_CRITICAL_ALERTS=True $ podman run --name=outage --net=host --env-host=true -v /root/.kube/config:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages $ podman logs -f outage 3. Try to access the cluster - $ oc get nodes or any other command after the nodes are back online at the end of the scenario
Actual results:
Cluster is not accessible - The connection to the server api.ravicluster.aws.rhperfscale.org:6443 was refused - did you specify the right host or port?
Expected results:
Cluster APIs are accessible and healthy.
Additional info:
- duplicates
-
OCPBUGS-33614 When Performed Node stop and start operation on ibmcloud and AWS deployment cluster becomes inaccessible
- Closed