-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.16
-
None
-
False
-
Description of problem:
During disaster recovery testing with stretch cluster, observed the router-default pods were both scheduled to surviving zone (datacenter 1) during outage. When the down zone (datacenter 2) was restored, the pods were NOT rebalanced and thus subsequent tests where datacenter 1 was down resulted in unexpected outage.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Occurred always when followed steps outlined below to reproduce
Steps to Reproduce:
1. Setup stretch cluster as defined here : https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/disaster-recovery-subscriptions_common#disaster-recovery-subscriptions_common 2. Simulate outage by taking down datacenter 2 3. Wait approx 8 min for eviction to occur and note the router-default pods will both be running on datacenter 1 4. Bring datacenter 2 up 5. Take down datacenter 1
Actual results:
Outage occurs
Expected results:
HA applications available after minimal (if any) outage as datacenter 2 is up
Additional info:
The topologySpreadConstraints for deployment/router-default in openshift-ingress namespace indicates to ScheduleAnyway - pods are NOT re-balanced after zone outages. topologySpreadConstraints: - labelSelector: matchExpressions: - key: ingresscontroller.operator.openshift.io/hash operator: In values: - 7d6bdccc5 maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway