-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.12
-
None
Description of problem (please be detailed as possible and provide log
snippests):
managed cluster1 - mc1
managed cluster2- mc2
Application failover from mc1 to mc2 stuck in "Failing Over" state as restoring the pvs to mc2 failed due to noobaa S3 communication failure.
Only namespace of the application got created on mc2 during the failover operation.
Before initiating the failover operation the noobaa status is Ready on both MC1 and MC2, uploading the BZ with the must-gather logs of mc1 and mc2 before failover operation.
Hub:
[root@a3e25001 ~]# oc get drpc -n busybox-sample
NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE
busybox-placement-1-drpc 20h ocsm4205001 ocpm4202001 Failover FailingOver
[root@a3e25001 ~]#
- oc get drpc busybox-placement-1-drpc -n busybox-sample -oyaml
...
status:
actionStartTime: "2023-04-24T18:08:50Z"
conditions:
- lastTransitionTime: "2023-04-24T18:08:50Z"
message: Started failover to cluster "ocpm4202001"
observedGeneration: 3
reason: NotStarted
status: "False"
type: PeerReady - lastTransitionTime: "2023-04-24T18:08:50Z"
message: Waiting for PV restore to complete...)
observedGeneration: 3
reason: FailingOver
status: "False"
type: Available
lastUpdateTime: "2023-04-25T14:34:01Z"
phase: FailingOver
preferredDecision:
clusterName: ocsm4205001
clusterNamespace: ocsm4205001
progression: WaitingForPVRestore
resourceConditions:
conditions: - lastTransitionTime: "2023-04-24T17:58:02Z"
message: PVCs in the VolumeReplicationGroup are ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady - lastTransitionTime: "2023-04-24T17:58:02Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected - lastTransitionTime: "2023-04-24T17:58:01Z"
message: Restored PV cluster data
observedGeneration: 1
reason: Restored
status: "True"
type: ClusterDataReady - lastTransitionTime: "2023-04-25T14:02:42Z"
message: VRG Kube object protect error
observedGeneration: 1
reason: UploadError
status: "False"
type: ClusterDataProtected
resourceMeta:
generation: 1
kind: VolumeReplicationGroup
name: busybox-placement-1-drpc
namespace: busybox-sample
protectedpvcs: - busybox-pvc
MC2:
[root@m4202001 ~]# oc get ns busybox-sample
NAME STATUS AGE
busybox-sample Active 20h
[root@m4202001 ~]# oc get all,pvc -n busybox-sample
No resources found in busybox-sample namespace.
[root@m4202001 ~]#
[root@m4202001 ~]# oc get po -n openshift-storage
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-6bb96f77b6-fcb22 2/2 Running 0 22h
csi-cephfsplugin-8h6td 2/2 Running 2 25h
csi-cephfsplugin-9nwpf 2/2 Running 2 25h
csi-cephfsplugin-provisioner-6c7d889599-25knr 5/5 Running 0 22h
csi-cephfsplugin-provisioner-6c7d889599-cn6kg 5/5 Running 0 22h
csi-cephfsplugin-sbx2r 2/2 Running 2 25h
csi-rbdplugin-484rx 3/3 Running 3 25h
csi-rbdplugin-5qpsx 3/3 Running 3 25h
csi-rbdplugin-k7qkv 3/3 Running 3 25h
csi-rbdplugin-provisioner-d46b79bbb-868p8 6/6 Running 0 22h
csi-rbdplugin-provisioner-d46b79bbb-frgq8 6/6 Running 0 22h
noobaa-core-0 1/1 Running 0 22h
noobaa-db-pg-0 1/1 Running 0 22h
noobaa-endpoint-5bdc586b7d-v97bf 1/1 Running 0 22h
noobaa-operator-66fb78dd94-m7lbh 1/1 Running 0 22h
ocs-metrics-exporter-6b96597864-sbrtd 1/1 Running 0 22h
ocs-operator-5598965945-pkmgw 1/1 Running 0 22h
odf-console-55f8c5f6dd-7fhxc 1/1 Running 0 22h
odf-operator-controller-manager-5cbb545ddc-h72wf 2/2 Running 0 22h
rook-ceph-operator-64bb84d64f-z5fs9 1/1 Running 0 22h
token-exchange-agent-7fd47f9bd8-m6465 1/1 Running 0 21h
[root@m4202001 ~]#oc get noobaa -n openshift-storage noobaa -o yaml
....
phase: Configuring
status: "False" [34/1412]
type: Available
- lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T18:05:12Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "True"
type: Progressing - lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:57:13Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "False"
type: Degraded - lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T18:05:12Z"
message: 'could not open file "base/16385/2601": Read-only file system'
reason: TemporaryError
status: "False"
type: Upgradeable - lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:57:13Z"
status: k8s
type: KMS-Type - lastHeartbeatTime: "2023-04-24T12:57:13Z"
lastTransitionTime: "2023-04-24T12:58:15Z"
status: Sync
type: KMS-Status
endpoints:
readyCount: 1
virtualHosts: - s3.openshift-storage.svc
observedGeneration: 2
phase: Configuring
readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck
out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl
-n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa
-o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou
can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait
noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:
\ master-20220913\n\tNooBaa Operator Version: 5.12.0\n"
RHCS:
[root@rhcs01 ~]# ceph -s
cluster:
id: 778d5284-ddf7-11ed-a790-525400c41d12
health: HEALTH_OK
services:
mon: 5 daemons, quorum rhcs01,rhcs02,rhcs04,rhcs05,rhcs07 (age 5d)
mgr: rhcs01.ipckaw(active, since 6d), standbys: rhcs04.kfpmco
mds: 1/1 daemons up, 1 standby
osd: 6 osds: 6 up (since 6d), 6 in (since 6d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 10 pools, 289 pgs
objects: 1.18k objects, 1.8 GiB
usage: 9.5 GiB used, 2.9 TiB / 2.9 TiB avail
pgs: 289 active+clean
Version of all relevant components (if applicable):
OCP: 4.12.11
odf-operator.v4.12.2-rhodf
RHCS: 5.3.z2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Configure Metro DR environment with MC1, MC2 and Hub cluster
2. Deploy sample application busybox
3. Apply fencing to mc1 on the hub cluster and verify that the fencing is successful
4. Initiate failover of the application from mc1 to mc2
Actual results:
Failover stuck in "Failing Over" state
Expected results:
Application failover should be successful
Additional info:
Must gather logs of mc1 and mc2 before failover operation:
https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link
https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link
Must gather logs of mc1, mc2 and hub after failover operation initiated:
https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link
https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link
https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link