Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.15
Component/s: ceph/CephFS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2271666
Dev Approval:
Committed
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.20
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
OCP 4.15.0-0.nightly-2024-03-05-113700
ODF 4.15.0-157
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
Submariner brew.registry.redhat.io/rh-osbs/iib:680159
VolSync 0.8.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

****Active hub co-situated with primary managed cluster****

1. After site failure (active hub and the primary managed cluster goes down) and moving to passive hub post hub recovery, all the workloads (RBD and CephFS) of both subscription and appset types and in different states Deployed, FailedOver, Relocated (1 each) which were running on primary managed cluster were failedover to the failovercluster (secondary) and the failover operation was successful.

Workloads were successfully running on the failovercluster (secondary) and VRG both states were marked as Primary for all these workloads.

We also had 4 workloads in the deployed state on C2 (RBD and CephFS, appset and subscription type, 1 each) and they remain as it is.

2. Now recover the older primary managed cluster and ensure it's successfully imported on the RHACM console (if not, create auto-import-secret for this cluster on the passive hub).
3. Monitor drpc cleanup status and lastGroupSyncTime for all the failedover workloads.
4. Assuming cleanup happens successfully (though it doesn't (apply workarounds if needed as we have bugs for both RBD and CephFS)), data sync should resume for all the workloads. Let IOs continue for a few days (3-4) where sync should progress as excepted without firing the VolSync.DelayAlert (which fires critical level alert when the difference b/w lastGroupSyncTime for any workload (per namespace) and current time in UTC is beyond 3x the sync interval as per drpolicy assigned to that particular workload.

Actual results: Slowness in data replication was observed even though the cluster is not heavily loaded and the total workload count is within the desired limit.

(Note: Later after a week or so, it was found that the osd-2 of cluster C1 was down and was unable to recover if it is contributing to the slowness).

rook-ceph-osd-2-d6855dd9b-fhx62 0/2 Init:0/4 0 5d7h <none> compute-1 <none> <none>

amanagrawal@Amans-MacBook-Pro hub % oc describe pod rook-ceph-osd-2-d6855dd9b-fhx62
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedAttachVolume 78s (x3146 over 4d10h) attachdetach-controller AttachVolume.Attach failed for volume "pvc-864f7514-df95-4e4f-bdc9-e7a663a9070e" : volume attachment is being deleted

Then @bmekhiss@redhat.com followed this to get rid of the volume attachment issue with osd-2.
https://kb.vmware.com/s/article/85213

but it probably didn't help.

After this, we collected the must gather which could be found here-
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/26mar24/

Passive hub-

amanagrawal@Amans-MacBook-Pro ~ % drpc
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:02Z 45m29.999316502s True
busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:08Z 45m24.522406539s True
busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:15Z 45m17.399723614s True
busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:15Z 51m16.600450889s True
busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:01Z 1h49m3.070653116s True
busybox-workloads-6 rbd-sub-busybox6-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:01Z 1h48m52.893419572s True
busybox-workloads-7 rbd-sub-busybox7-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:23:49Z 1h49m5.411490083s True
busybox-workloads-8 rbd-sub-busybox8-placement-1-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 724.385424ms True
openshift-gitops cephfs-appset-busybox10-placement-drpc 18d amagrawa-c1 amagrawa-c2 Relocate Relocated Completed 2024-03-26T17:29:48Z 29m16.062088831s True
openshift-gitops cephfs-appset-busybox11-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:36Z 44m56.299638791s True
openshift-gitops cephfs-appset-busybox12-placement-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 51m14.942082281s True
openshift-gitops cephfs-appset-busybox9-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:31Z 44m58.19020063s True
openshift-gitops rbd-appset-busybox1-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:26Z 3h25m19.310948353s True
openshift-gitops rbd-appset-busybox2-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:25Z 436h13m52.709973551s True
openshift-gitops rbd-appset-busybox3-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:13Z 436h13m39.761048611s True
openshift-gitops rbd-appset-busybox4-placement-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 297.972857ms True

amanagrawal@Amans-MacBook-Pro ~ % date -u
Tue Mar 26 19:33:39 UTC 2024

amanagrawal@Amans-MacBook-Pro ~ % group
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
namespace: busybox-workloads-13
namespace: busybox-workloads-13
lastGroupSyncTime: "2024-03-26T15:51:26Z"
namespace: busybox-workloads-13
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
namespace: busybox-workloads-14
namespace: busybox-workloads-14
lastGroupSyncTime: "2024-03-26T17:35:43Z"
namespace: busybox-workloads-14
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
namespace: busybox-workloads-15
namespace: busybox-workloads-15
lastGroupSyncTime: "2024-03-26T17:35:24Z"
namespace: busybox-workloads-15
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-16
namespace: busybox-workloads-16
namespace: busybox-workloads-16
lastGroupSyncTime: "2024-03-26T17:36:42Z"
namespace: busybox-workloads-16
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-5
namespace: busybox-workloads-5
namespace: busybox-workloads-5
lastGroupSyncTime: "2024-03-26T19:30:00Z"
namespace: busybox-workloads-5
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
namespace: busybox-workloads-6
namespace: busybox-workloads-6
lastGroupSyncTime: "2024-03-26T19:30:02Z"
namespace: busybox-workloads-6
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
namespace: busybox-workloads-7
namespace: busybox-workloads-7
lastGroupSyncTime: "2024-03-26T19:30:02Z"
namespace: busybox-workloads-7
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
namespace: busybox-workloads-8
namespace: busybox-workloads-8
lastGroupSyncTime: "2024-03-26T19:30:00Z"
namespace: busybox-workloads-8
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:11:40Z"
namespace: busybox-workloads-10
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T17:37:57Z"
namespace: busybox-workloads-11
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:27:18Z"
namespace: busybox-workloads-12
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T16:27:29Z"
namespace: busybox-workloads-9
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:30:02Z"
namespace: busybox-workloads-1
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:30:02Z"
namespace: busybox-workloads-2
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:30:00Z"
namespace: busybox-workloads-3
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-26T19:30:00Z"
namespace: busybox-workloads-4

Almost 6-8 workloads are firing the alert since the last week as lastGroupSyncTime is delayed.

C1-

amanagrawal@Amans-MacBook-Pro c1 % cephdf
— RAW STORAGE —
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 1.5 TiB 1.4 TiB 133 GiB 133 GiB 8.64
TOTAL 1.5 TiB 1.4 TiB 133 GiB 133 GiB 8.64

— POOLS —
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
ocs-storagecluster-cephblockpool 1 64 255 GiB 118.47k 83 GiB 6.81 571 GiB
ocs-storagecluster-cephobjectstore.rgw.otp 2 8 0 B 0 0 B 0 381 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 3 8 0 B 0 0 B 0 381 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 8 124 KiB 11 248 KiB 0 571 GiB
ocs-storagecluster-cephobjectstore.rgw.meta 5 8 10 KiB 17 156 KiB 0 571 GiB
ocs-storagecluster-cephobjectstore.rgw.log 6 8 944 KiB 340 3.7 MiB 0 571 GiB
ocs-storagecluster-cephobjectstore.rgw.control 7 8 0 B 8 0 B 0 571 GiB
.rgw.root 8 8 8.7 KiB 16 180 KiB 0 571 GiB
ocs-storagecluster-cephfilesystem-metadata 9 32 3.5 GiB 1.98k 7.0 GiB 0.61 571 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 32 195 KiB 111 1.3 MiB 0 571 GiB
.mgr 11 1 1.0 MiB 2 2.1 MiB 0 571 GiB
ocs-storagecluster-cephfilesystem-data0 12 32 17 GiB 1.17M 34 GiB 2.90 571 GiB

C2-

amanagrawal@Amans-MacBook-Pro c2 % cephdf
— RAW STORAGE —
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 1.5 TiB 1.4 TiB 80 GiB 80 GiB 5.19
TOTAL 1.5 TiB 1.4 TiB 80 GiB 80 GiB 5.19

— POOLS —
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
ocs-storagecluster-cephblockpool 1 64 13 GiB 5.57k 38 GiB 3.03 408 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 2 8 0 B 0 0 B 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.otp 3 8 0 B 0 0 B 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.control 4 8 0 B 8 0 B 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.log 5 8 692 KiB 340 3.9 MiB 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index 6 8 874 KiB 11 2.6 MiB 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.meta 7 8 4.2 KiB 17 148 KiB 0 408 GiB
.rgw.root 8 8 5.8 KiB 16 180 KiB 0 408 GiB
ocs-storagecluster-cephfilesystem-metadata 9 32 1.4 GiB 614 4.3 GiB 0.35 408 GiB
.mgr 10 1 769 KiB 2 2.3 MiB 0 408 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data 11 32 1.8 GiB 1.61k 5.4 GiB 0.44 408 GiB
ocs-storagecluster-cephfilesystem-data0 12 32 8.5 GiB 1.11M 26 GiB 2.06 408 GiB

The utilization is after running IOs for more than a week.

Expected results: Slowness shouldn't be observed in the data replication when the cluster isn't heavily loaded.

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty