Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.17
Component/s: ceph/CephFS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2303490
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.18
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem (please be detailed as possible and provide log
snippests):

I observed below warning in Ceph health immediately after performing node drain where active mds was running on that drained node.

Degraded data redundancy: 1263285/8784987 objects degraded (14.380%), 1 pg degraded, 1 pg undersized

ceph status:
sh-5.1$ ceph status
cluster:
id: 994259aa-5177-4411-bb6d-5f41e6d2bde0
health: HEALTH_WARN
Degraded data redundancy: 1263285/8784987 objects degraded (14.380%), 1 pg degraded, 1 pg undersized

services:
mon: 3 daemons, quorum a,b,c (age 36m)
mgr: a(active, since 37m), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 36m), 3 in (since 5h); 1 remapped pgs

data:
volumes: 1/1 healthy
pools: 4 pools, 4 pgs
objects: 2.93M objects, 4.2 GiB
usage: 50 GiB used, 250 GiB / 300 GiB avail
pgs: 1263285/8784987 objects degraded (14.380%)
3 active+clean
1 active+undersized+degraded+remapped+backfilling

io:
client: 1.8 KiB/s rd, 107 KiB/s wr, 2 op/s rd, 109 op/s wr
recovery: 2.7 KiB/s, 147 objects/s

-------------------------------------------------------------------------------
ceph status -w :
---------------

2024-08-07T15:31:58.474864+0000 mon.a [INF] osd.0 marked itself down and dead
2024-08-07T15:31:59.446878+0000 mon.a [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-08-07T15:31:59.446909+0000 mon.a [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2024-08-07T15:31:59.446917+0000 mon.a [WRN] Health check failed: 1 zone (1 osds) down (OSD_ZONE_DOWN)
2024-08-07T15:32:08.455715+0000 mon.c [INF] mon.c calling monitor election
2024-08-07T15:32:08.463852+0000 mon.a [INF] mon.a calling monitor election
2024-08-07T15:32:13.472954+0000 mon.a [INF] mon.a is new leader, mons a,c in quorum (ranks 0,2)
2024-08-07T15:32:13.504953+0000 mon.a [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)
2024-08-07T15:32:13.507412+0000 mon.a [INF] osd.0 failed (root=default,region=us-south,zone=us-south-2,host=ocs-deviceset-1-data-0gbnf7) (connection refused reported by osd.2)
2024-08-07T15:32:13.507697+0000 mon.a [INF] Active manager daemon a restarted
2024-08-07T15:32:13.508133+0000 mon.a [WRN] Health check failed: 1 osds down (OSD_DOWN)
2024-08-07T15:32:13.508154+0000 mon.a [WRN] Health check failed: 1 OSDs or CRUSH

{nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set (OSD_FLAGS)
2024-08-07T15:32:13.508161+0000 mon.a [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2024-08-07T15:32:13.508172+0000 mon.a [WRN] Health check failed: 1 zone (1 osds) down (OSD_ZONE_DOWN)
2024-08-07T15:32:13.508438+0000 mon.a [INF] Activating manager daemon a
2024-08-07T15:32:13.524015+0000 mon.a [WRN] Health detail: HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1/3 mons down, quorum a,c
2024-08-07T15:32:13.524031+0000 mon.a [WRN] [WRN] FS_DEGRADED: 1 filesystem is degraded
2024-08-07T15:32:13.524037+0000 mon.a [WRN] fs ocs-storagecluster-cephfilesystem is degraded
2024-08-07T15:32:13.524041+0000 mon.a [WRN] [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
2024-08-07T15:32:13.524046+0000 mon.a [WRN] have 0; want 1 more
2024-08-07T15:32:13.524050+0000 mon.a [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum a,c
2024-08-07T15:32:13.524056+0000 mon.a [WRN] mon.b (rank 1) addr v2:172.30.111.99:3300/0 is down (out of quorum)
2024-08-07T15:32:13.546974+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:13.571798+0000 mon.a [INF] Manager daemon a is now available
2024-08-07T15:32:14.500214+0000 mon.a [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available)
2024-08-07T15:32:15.139718+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:15.602506+0000 mon.a [WRN] Health check failed: Degraded data redundancy: 2122080/6366240 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:15.725995+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:16.713176+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:17.720229+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:18.754681+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:19.759228+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:20.821000+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:21.829013+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:21.846112+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2129714/6389142 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:22.854516+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:23.858571+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:24.913407+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:25.919793+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:26.969597+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:27.970493+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-b restarted
2024-08-07T15:32:30.343628+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2129618/6388854 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:34.242144+0000 mon.a [INF] daemon mds.ocs-storagecluster-cephfilesystem-a is now active in filesystem ocs-storagecluster-cephfilesystem as rank 0
2024-08-07T15:32:34.622939+0000 mon.a [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2024-08-07T15:32:35.347915+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2129053/6387159 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:40.350852+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2128471/6385413 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:42.287909+0000 mon.b [INF] mon.b calling monitor election
2024-08-07T15:32:42.293112+0000 mon.a [INF] mon.a calling monitor election
2024-08-07T15:32:42.301939+0000 mon.a [INF] mon.a is new leader, mons a,b,c in quorum (ranks 0,1,2)
2024-08-07T15:32:42.317047+0000 mon.a [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum a,c)
2024-08-07T15:32:42.328443+0000 mon.a [WRN] Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes}

have

{NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 2128399/6385197 objects degraded (33.333%), 4 pgs degraded
2024-08-07T15:32:42.328475+0000 mon.a [WRN] [WRN] OSD_DOWN: 1 osds down
2024-08-07T15:32:42.328483+0000 mon.a [WRN] osd.0 (root=default,region=us-south,zone=us-south-2,host=ocs-deviceset-1-data-0gbnf7) is down
2024-08-07T15:32:42.328488+0000 mon.a [WRN] [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT}

flags set
2024-08-07T15:32:42.328495+0000 mon.a [WRN] zone us-south-2 has flags noout
2024-08-07T15:32:42.328505+0000 mon.a [WRN] [WRN] OSD_HOST_DOWN: 1 host (1 osds) down
2024-08-07T15:32:42.328511+0000 mon.a [WRN] host ocs-deviceset-1-data-0gbnf7 (root=default,region=us-south,zone=us-south-2) (1 osds) is down
2024-08-07T15:32:42.328525+0000 mon.a [WRN] [WRN] OSD_ZONE_DOWN: 1 zone (1 osds) down
2024-08-07T15:32:42.328539+0000 mon.a [WRN] zone us-south-2 (root=default,region=us-south) (1 osds) is down
2024-08-07T15:32:42.328554+0000 mon.a [WRN] [WRN] PG_DEGRADED: Degraded data redundancy: 2128399/6385197 objects degraded (33.333%), 4 pgs degraded
2024-08-07T15:32:42.328575+0000 mon.a [WRN] pg 1.0 is active+undersized+degraded, acting [2,1]
2024-08-07T15:32:42.328584+0000 mon.a [WRN] pg 2.0 is active+undersized+degraded, acting [1,2]
2024-08-07T15:32:42.328590+0000 mon.a [WRN] pg 3.0 is active+undersized+degraded, acting [2,1]
2024-08-07T15:32:42.328615+0000 mon.a [WRN] pg 4.0 is active+undersized+degraded, acting [1,2]
2024-08-07T15:32:46.365040+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2128694/6386082 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:32:51.582806+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2133105/6399315 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:00.365860+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2132468/6397404 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:05.369169+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2131774/6395322 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:07.745128+0000 mon.a [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2024-08-07T15:33:07.745153+0000 mon.a [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2024-08-07T15:33:07.745179+0000 mon.a [INF] Health check cleared: OSD_ZONE_DOWN (was: 1 zone (1 osds) down)
2024-08-07T15:33:07.788895+0000 mon.a [INF] osd.0 [v2:10.131.0.39:6800/2123740462,v1:10.131.0.39:6801/2123740462] boot
2024-08-07T15:33:07.002817+0000 osd.0 [WRN] OSD bench result of 29015.942307 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.0. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
2024-08-07T15:33:10.375207+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2144800/6434400 objects degraded (33.333%), 4 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:11.878638+0000 mon.a [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2144800/6434400 objects degraded (33.333%), 4 pgs degraded)
2024-08-07T15:33:16.388746+0000 mon.a [WRN] Health check failed: Degraded data redundancy: 2147287/6445536 objects degraded (33.314%), 2 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:22.559904+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2146906/6444624 objects degraded (33.313%), 2 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:30.388359+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2155367/6470181 objects degraded (33.312%), 2 pgs degraded (PG_DEGRADED)
2024-08-07T15:33:35.391672+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2168686/6510342 objects degraded (33.311%), 1 pg degraded (PG_DEGRADED)

Version of all relevant components (if applicable):

ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)
ocp: 4.17.0-0.nightly-2024-08-06-235322
odf: 4.17.0-65.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes--> Impacting automation.

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Run IO on PVCs created using ceph filesystem
2. When Io consumes more memory in active MDS, please perform node drain where mds is running.
3. Ceph health will give warning about pg degraded and it will be in the same state forever though mds up and running. All pods are running fine.

Actual results:
Observed PG_DEGRADED warnings in ceph health after node drain.

Expected results:
Ceph health should be OK when all pod are up and running even after node drain.

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty