Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.17.z, 4.16.z
Component/s: RHCOS
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
5
Severity:
Moderate
Regression:
None

AssignedTeam:
rhel-fs

Target Backport Versions:

4.16.z
Target Version:

4.17.z
Release Blocker:
Proposed
Sprint:
CoreOS West - 271
sprint_count:
1

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue RHEL-44944 to use for version rhel-9.4.z
–
Original description:
This is a clone of issue RHEL-7984 to use for version rhel-9.5
–
Original description:
Description of problem:

Filesystem mount blocked for a long time, vmcore collected, 594311032

we can see original mount process blocked on this stack :

crash> bt
PID: 20412    TASK: ffff9b55dc432800  CPU: 0    COMMAND: "mount"
 #0 [ffffc0d4e40238c8] __schedule at ffffffff91beafe6
 #1 [ffffc0d4e4023920] schedule at ffffffff91beb735
 #2 [ffffc0d4e4023930] xfs_extent_busy_flush at ffffffffc03b49aa [xfs]
 #3 [ffffc0d4e4023980] xfs_alloc_ag_vextent_size at ffffffffc03622c5 [xfs]
 #4 [ffffc0d4e40239f0] xfs_alloc_ag_vextent at ffffffffc0362f16 [xfs]
 #5 [ffffc0d4e4023a00] xfs_alloc_fix_freelist at ffffffffc036369b [xfs]
 #6 [ffffc0d4e4023b08] xfs_free_extent_fix_freelist at ffffffffc0363e54 [xfs]
 #7 [ffffc0d4e4023bb8] __xfs_free_extent at ffffffffc0363ef8 [xfs]
 #8 [ffffc0d4e4023c18] xfs_trans_free_extent at ffffffffc03df456 [xfs]
 #9 [ffffc0d4e4023c70] xfs_efi_item_recover at ffffffffc03df6c1 [xfs]
#10 [ffffc0d4e4023cb8] xlog_recover_process_intents at ffffffffc03e7433 [xfs]
#11 [ffffc0d4e4023d60] xlog_recover_finish at ffffffffc03e794c [xfs]
#12 [ffffc0d4e4023d78] xfs_log_mount_finish at ffffffffc03d93d2 [xfs]
#13 [ffffc0d4e4023da0] xfs_mountfs at ffffffffc03cbf92 [xfs]
#14 [ffffc0d4e4023df8] xfs_fs_fill_super at ffffffffc03d08ac [xfs]
#15 [ffffc0d4e4023e30] get_tree_bdev at ffffffff915603ff
#16 [ffffc0d4e4023e70] vfs_get_tree at ffffffff9155ead5
#17 [ffffc0d4e4023e90] do_mount at ffffffff91585209
#18 [ffffc0d4e4023ef8] ksys_mount at ffffffff91585c1e
#19 [ffffc0d4e4023f30] __x64_sys_mount at ffffffff91585c61
#20 [ffffc0d4e4023f38] do_syscall_64 at ffffffff912043ab
#21 [ffffc0d4e4023f50] entry_SYSCALL_64_after_hwframe at ffffffff91c000a9
    RIP: 00007f1878cb135e  RSP: 00007fff7025cd78  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00005633b04ada40  RCX: 00007f1878cb135e
    RDX: 00005633b04b5250  RSI: 00005633b04adc40  RDI: 00005633b04adc20
    RBP: 00007f1879b23184   R8: 0000000000000000   R9: 0000000000000002
    R10: 00000000c0ed0000  R11: 0000000000000246  R12: 0000000000000000
    R13: 00000000c0ed0000  R14: 00005633b04adc20  R15: 00005633b04b5250
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

// here :

581 void
582 xfs_extent_busy_flush(
583         struct xfs_mount        *mp,
584         struct xfs_perag        *pag,
585         unsigned                busy_gen)
586 {
587         DEFINE_WAIT             (wait);
588         int                     error;
589 
590         error = xfs_log_force(mp, XFS_LOG_SYNC);
591         if (error)
592                 return;
593 
594         do {
595                 prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
596                 if  (busy_gen != READ_ONCE(pag->pagb_gen))
597                         break;
598                 schedule(); <<<<---
599         } while (1);

// we've been running for a long time :

crash> ps -t 20412
PID: 20412    TASK: ffff9b55dc432800  CPU: 0    COMMAND: "mount"
    RUN TIME: 02:49:31
  START TIME: 304267682759
       UTIME: 5832582
       STIME: 9204315314778


// for some reason on vmcore process is on RUN state, but i suspect is not the case
// if i look for busy extents i don't find any but on current transaction :

crash> xfs_cil_ctx.busy_extents 0xffff9b5528f60300
  busy_extents = {
    next = 0xffff9b5528f60330,
    prev = 0xffff9b5528f60330
  },
crash> xfs_cil_ctx.busy_extents 0xffff9b5528f60300 -ox
struct xfs_cil_ctx {
  [ffff9b5528f60330] struct list_head busy_extents;
}


crash> xfs_trans.t_busy ffff9b54c9bfbed0 -ox
struct xfs_trans {
  [ffff9b54c9bfbf90] struct list_head t_busy;
}

crash> list -H ffff9b54c9bfbf90 -o xfs_extent_busy.list
ffff9b5598c14980
crash> xfs_extent_busy ffff9b5598c14980
struct xfs_extent_busy {
  rb_node = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  list = {
    next = 0xffff9b54c9bfbf90,
    prev = 0xffff9b54c9bfbf90
  },
  agno = 13,
  bno = 14770,
  length = 19,
  flags = 0
}

// which seems strange, we're on ag13, its true this fs have a big amount of
// AGs, but looks we're deadlock/stuck on that particular busy extent flushing

// a metadump was collected and i can reproduce the issue

// i suspect we might be dealing with issue described on :

https://www.spinics.net/lists/linux-xfs/msg72870.html

// in that case we may need :

8ebbf262d468 xfs: don't block in busy flushing when freeing extents

// on RHEL8/9

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

is cloned by

OCPBUGS-56574 RHCOS 4.16 XFS: recovery blocked on xfs_extent_busy_flush [rhel-9.4.z]

Closed

links to

RHBA-2025:8280 OpenShift Container Platform 4.17.32 bug fix update

Assignee:: Tiago Bueno

Reporter:: Watson Automation

Need Info From:: None

Contributors:: None

QA Contact:: Michael Nguyen

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2025/05/21 3:51 PM

Updated:: 2025/10/09 9:41 PM

Resolved:: 2025/06/04 9:21 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide