-
Bug
-
Resolution: Unresolved
-
Major
-
rhel-9.5.z
-
kmod-kvdo-8.2.6.3-165.el9
-
No
-
Important
-
ZStream
-
Customer Facing
-
rhel-storage-dm
-
ssg_platform_storage
-
16
-
22
-
2
-
Dev ack, PXE ack
-
False
-
False
-
-
Yes
-
None
-
Approved Blocker
-
Pass
-
RegressionOnly
-
Bug Fix
-
-
Done
-
Done
-
Done
-
Not Required
-
-
x86_64
-
None
Problem:
This is essentially a reopen of Jira https://issues.redhat.com/browse/RHEL-42515
https://issues.redhat.com/browse/RHEL-42515
System crashes with the kernel panic stack trace:
crash> bt
PID: 1375 TASK: ff3746aac7fda300 CPU: 19 COMMAND: "kvdo0:hashQ0"
#0 [ff7b836248aa3bf0] machine_kexec at ffffffffb6e7a897
#1 [ff7b836248aa3c48] __crash_kexec at ffffffffb6ffaeba
#2 [ff7b836248aa3d08] crash_kexec at ffffffffb6ffbfe8
#3 [ff7b836248aa3d10] oops_end at ffffffffb6e31dea
#4 [ff7b836248aa3d30] page_fault_oops at ffffffffb6e8c25b
#5 [ff7b836248aa3d88] exc_page_fault at ffffffffb7ad2d62
#6 [ff7b836248aa3db0] asm_exc_page_fault at ffffffffb7c00bb2
[exception RIP: finish_querying+0xca]
RIP: ffffffffc0e6c87a RSP: ff7b836248aa3e60 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ff7b83624a2638b8 RCX: 0000000000000017
RDX: ff7b83624a248410 RSI: 0000000000000004 RDI: ff7b836249b60f50
RBP: ff7b836249b60f50 R8: ff7b83624a248410 R9: ff7b83624a248410
R10: 000000000000002a R11: ff7b836249b70fe0 R12: ff7b83624a2e2e28
R13: ff7b83624a23d148 R14: ff7b83624a263950 R15: ff7b83624a23d1c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ff7b836248aa3e98] service_work_queue at ffffffffc0eb2cb3 [kvdo]
#8 [ff7b836248aa3f00] work_queue_runner at ffffffffc0eb2ef8 [kvdo]
#9 [ff7b836248aa3f18] kthread at ffffffffb6f38abd
#10 [ff7b836248aa3f50] ret_from_fork at ffffffffb6e03e89
The NULL pointer dereference happens in:
finish_querying() start_locking() — inlined launch_data_vio_duplicate_zone_callback() — inlined
static void lock_duplicate_pbn(struct vdo_completion *completion) { unsigned int increment_limit; struct pbn_lock *lock; int result; struct data_vio *agent = as_data_vio(completion); struct slab_depot *depot = vdo_from_data_vio(agent)->depot; struct physical_zone *zone = agent->duplicate.zone; <--- this was NULL assert_data_vio_in_duplicate_zone(agent); <-- dereference the NULL pointer in this function (see below) ... }
/** assert_data_vio_in_duplicate_zone() - Check that a data_vio is running on the correct thread for its duplicate zone. @data_vio: The data_vio in question. */ static inline void assert_data_vio_in_duplicate_zone(struct data_vio *data_vio) { thread_id_t expected = data_vio->duplicate.zone->thread_id; <-- this is the place of dereference ... }
crash> struct data_vio.duplicate 0xff7b836249b60f50 duplicate = { pbn = 0x0, state = VDO_MAPPING_STATE_UNMAPPED, zone = 0x0 <-- reason for the crash },
Further vmcore analysis (and the cofre location) notes in subsequent comments.
What is the impact of this issue to you?
System panic and crash, disruption to production
Please provide the package NVR for which the bug is seen:
- RHEL 9.5.z, kernel version 5.14.0-503.29.1.el9_5.x86_64
- kmod-kvdo 8.2.4.15-141.el9_5
How reproducible is this bug?:
Not at will, during normal production
- links to
-
RHBA-2025:146896 kmod-kvdo bug fix and enhancement update