Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.12
Component/s: unclassified
Labels:
- TestBlocker

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Internal Whiteboard:
4.12.5
Bugzilla Bug:
RHBZ: 2189547
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.12.z
Intelligence Requested:
Market:

Release Blocker:
Proposed

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

managed cluster1 - mc1
managed cluster2- mc2

Application failover from mc1 to mc2 stuck in "Failing Over" state as restoring the pvs to mc2 failed due to noobaa S3 communication failure.
Only namespace of the application got created on mc2 during the failover operation.

Before initiating the failover operation the noobaa status is Ready on both MC1 and MC2, uploading the BZ with the must-gather logs of mc1 and mc2 before failover operation.

Hub:

[root@a3e25001 ~]# oc get drpc -n busybox-sample
NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE
busybox-placement-1-drpc 20h ocsm4205001 ocpm4202001 Failover FailingOver
[root@a3e25001 ~]#

oc get drpc busybox-placement-1-drpc -n busybox-sample -oyaml
...
status:
actionStartTime: "2023-04-24T18:08:50Z"
conditions:

lastTransitionTime: "2023-04-24T18:08:50Z"
message: Started failover to cluster "ocpm4202001"
observedGeneration: 3
reason: NotStarted
status: "False"
type: PeerReady
lastTransitionTime: "2023-04-24T18:08:50Z"
message: Waiting for PV restore to complete...)
observedGeneration: 3
reason: FailingOver
status: "False"
type: Available
lastUpdateTime: "2023-04-25T14:34:01Z"
phase: FailingOver
preferredDecision:
clusterName: ocsm4205001
clusterNamespace: ocsm4205001
progression: WaitingForPVRestore
resourceConditions:
conditions:
lastTransitionTime: "2023-04-24T17:58:02Z"
message: PVCs in the VolumeReplicationGroup are ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady
lastTransitionTime: "2023-04-24T17:58:02Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected
lastTransitionTime: "2023-04-24T17:58:01Z"
message: Restored PV cluster data
observedGeneration: 1
reason: Restored
status: "True"
type: ClusterDataReady
lastTransitionTime: "2023-04-25T14:02:42Z"
message: VRG Kube object protect error
observedGeneration: 1
reason: UploadError
status: "False"
type: ClusterDataProtected
resourceMeta:
generation: 1
kind: VolumeReplicationGroup
name: busybox-placement-1-drpc
namespace: busybox-sample
protectedpvcs:
busybox-pvc

MC2:

[root@m4202001 ~]# oc get ns busybox-sample
NAME STATUS AGE
busybox-sample Active 20h

[root@m4202001 ~]# oc get all,pvc -n busybox-sample
No resources found in busybox-sample namespace.
[root@m4202001 ~]#

[root@m4202001 ~]# oc get po -n openshift-storage
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-6bb96f77b6-fcb22 2/2 Running 0 22h
csi-cephfsplugin-8h6td 2/2 Running 2 25h
csi-cephfsplugin-9nwpf 2/2 Running 2 25h
csi-cephfsplugin-provisioner-6c7d889599-25knr 5/5 Running 0 22h
csi-cephfsplugin-provisioner-6c7d889599-cn6kg 5/5 Running 0 22h
csi-cephfsplugin-sbx2r 2/2 Running 2 25h
csi-rbdplugin-484rx 3/3 Running 3 25h
csi-rbdplugin-5qpsx 3/3 Running 3 25h
csi-rbdplugin-k7qkv 3/3 Running 3 25h
csi-rbdplugin-provisioner-d46b79bbb-868p8 6/6 Running 0 22h
csi-rbdplugin-provisioner-d46b79bbb-frgq8 6/6 Running 0 22h
noobaa-core-0 1/1 Running 0 22h
noobaa-db-pg-0 1/1 Running 0 22h
noobaa-endpoint-5bdc586b7d-v97bf 1/1 Running 0 22h
noobaa-operator-66fb78dd94-m7lbh 1/1 Running 0 22h
ocs-metrics-exporter-6b96597864-sbrtd 1/1 Running 0 22h
ocs-operator-5598965945-pkmgw 1/1 Running 0 22h
odf-console-55f8c5f6dd-7fhxc 1/1 Running 0 22h
odf-operator-controller-manager-5cbb545ddc-h72wf 2/2 Running 0 22h
rook-ceph-operator-64bb84d64f-z5fs9 1/1 Running 0 22h
token-exchange-agent-7fd47f9bd8-m6465 1/1 Running 0 21h

[root@m4202001 ~]#oc get noobaa -n openshift-storage noobaa -o yaml

....
phase: Configuring
status: "False" [34/1412]
type: Available

lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T18:05:12Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "True"
type: Progressing
lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:57:13Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "False"
type: Degraded
lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T18:05:12Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "False"
type: Upgradeable
lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:57:13Z"
status: k8s
type: KMS-Type
lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:58:15Z"
status: Sync
type: KMS-Status
endpoints:
readyCount: 1
virtualHosts:
s3.openshift-storage.svc
observedGeneration: 2
phase: Configuring
readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck
out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl
-n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa
-o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou
can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait
noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:
\ master-20220913\n\tNooBaa Operator Version: 5.12.0\n"

RHCS:

[root@rhcs01 ~]# ceph -s
cluster:
id: 778d5284-ddf7-11ed-a790-525400c41d12
health: HEALTH_OK

services:
mon: 5 daemons, quorum rhcs01,rhcs02,rhcs04,rhcs05,rhcs07 (age 5d)
mgr: rhcs01.ipckaw(active, since 6d), standbys: rhcs04.kfpmco
mds: 1/1 daemons up, 1 standby
osd: 6 osds: 6 up (since 6d), 6 in (since 6d)
rgw: 2 daemons active (2 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 10 pools, 289 pgs
objects: 1.18k objects, 1.8 GiB
usage: 9.5 GiB used, 2.9 TiB / 2.9 TiB avail
pgs: 289 active+clean

Version of all relevant components (if applicable):
OCP: 4.12.11
odf-operator.v4.12.2-rhodf
RHCS: 5.3.z2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Configure Metro DR environment with MC1, MC2 and Hub cluster
2. Deploy sample application busybox
3. Apply fencing to mc1 on the hub cluster and verify that the fencing is successful
4. Initiate failover of the application from mc1 to mc2

Actual results:
Failover stuck in "Failing Over" state

Expected results:
Application failover should be successful

Additional info:

Must gather logs of mc1 and mc2 before failover operation:

https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link

https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link

Must gather logs of mc1, mc2 and hub after failover operation initiated:

https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link

https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link

https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link