Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: ACM 2.10.2
Component/s: Application Lifecycle
Labels:
- ODF

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

During failover of a subscription based workload (e.g from cluster c1), ramen restores workload PVC on the failover cluster (e.g. c2). When the PVC is ready, ramen changes the PlacmentDecision status to the failover cluster (e.g. c2). At this point ACM should deploy the subscription on the managed cluster (e.g. c2).

When this rare bug is reproduced, this never happens, A manifestwork for the subscription is not created and no progress happen for days.

If we scale down and up again the multicluster-operators-hub-subscription deployment, the manifestwork is created after few minutes and the application is deployed on the managed cluster.

Version-Release number of selected component (if applicable):

% oc get csv -n open-cluster-management 
NAME                                               DISPLAY                                      VERSION             REPLACES                                          PHASE
advanced-cluster-management.v2.10.3                Advanced Cluster Management for Kubernetes   2.10.3              advanced-cluster-management.v2.10.2               Succeeded
odf-multicluster-orchestrator.v4.16.0-100.stable   ODF Multicluster Orchestrator                4.16.0-100.stable   odf-multicluster-orchestrator.v4.16.0-86.stable   Succeeded
odr-hub-operator.v4.16.0-100.stable                Openshift DR Hub Operator                    4.16.0-100.stable   odr-hub-operator.v4.16.0-86.stable                Succeeded
openshift-gitops-operator.v1.12.1                  Red Hat OpenShift GitOps                     1.12.1              openshift-gitops-operator.v1.12.0                 Succeeded

How reproducible:

random, happened about 2 times in last year.

Steps to Reproduce:

We don't know how to reproduce this.

Actual results:

manifestwork is not created and workload not deployed on failover cluster

% oc get manifestwork -n c01-mdr-c2 | grep vm16-datavol-sub-02
vm16-datavol-sub-02-placement-1-drpc-vm16-datavol-sub-02-ns-mw    8d
vm16-datavol-sub-02-placement-1-drpc-vm16-datavol-sub-02-vrg-mw   8d

Expected results:

manifestwork is create and workload deployed on failover cluster

% oc get manifestwork -n c01-mdr-c2 | grep vm16-datavol-sub-02
vm16-datavol-sub-02-placement-1-drpc-vm16-datavol-sub-02-ns-mw    8d
vm16-datavol-sub-02-placement-1-drpc-vm16-datavol-sub-02-vrg-mw   8d
vm16-datavol-sub-02-vm16-datavol-sub-02-subscription-1            77m

Additional info:

To fix the issue we scaled down and up the hub-subscription operator:

% oc scale deployment -n open-cluster-management multicluster-operators-hub-subscription --replicas=0
% oc scale deployment -n open-cluster-management multicluster-operators-hub-subscription --replicas=1

Attached files:

gather.acm.tar.xz - kubectl gather of all acm namespaces (open-cluster-management* and manged clusters namespaces in all clusters when the system was broken
gather.acm.fixed.tar.gz - same after scaling down/up the hub-subscription operator

Links:

ODF bug: https://bugzilla.redhat.com/2291343

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

gather.acm.fixed.tar.xz
2024/06/17 6:18 PM
14.29 MB
Nir Soffer
gather.acm.tar.xz
2024/06/17 6:18 PM
13.32 MB
Nir Soffer

Details

Description