Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: rhel-9.6
Affects Version/s: rhel-8.6.0
Component/s: sanlock
Labels:
- EasyFix

Regression:
None
Severity:
Important

Pool Team:

sst_logical_storage
Sub-System Group:

ssg_platform_storage

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:
Architecture:

x86_64

PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Priority Data:
PX Review Complete:
PX Scheduling Request:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

This is the scenario of:

2 RHV Data-Centers A (active) and B (passive), one one each site.
Storage is replicated from A to B
Upon failure of first site, the Storage Domains replication is stopped, and the RHV on Site B imports the Storage Domains to initiate the DR procedure

That import ultimately fails.

To initiate the import on Site B, the RHV hypervisor first acquires id 250 on lockspace ids, this part works:

2024-05-25 11:57:52 52 [53295]: sanlock daemon started 3.8.4 host 4c4c4544-0051-5210-8038-c8c04f585032 (c2prdrev51is.infra.e-mip.eu)
2024-05-25 11:58:12 72 [53295]: cmd_status 2,12
2024-05-25 12:05:37 517 [53295]: cmd_status 2,12
2024-05-25 12:11:52 892 [53345]: cmd_add_lockspace 2,12 4484db8c-4e14-4f86-852b-b2cffe079c02:250:/dev/4484db8c-4e14-4f86-852b-b2cffe079c02/ids:0 flags 0 timeout 10
2024-05-25 12:11:52 892 [53345]: s1 lockspace 4484db8c-4e14-4f86-852b-b2cffe079c02:250:/dev/4484db8c-4e14-4f86-852b-b2cffe079c02/ids:0
2024-05-25 12:11:52 893 [92233]: s1 set_lockspace_max_sectors_kb hw_kb 32767 setting 1024
2024-05-25 12:11:52 893 [92233]: write_sysfs_size open error 13 /sys/dev/block/253:195/queue/max_sectors_kb
2024-05-25 12:11:52 893 [92233]: set_max_sectors_kb write 1024 error -1 /dev/4484db8c-4e14-4f86-852b-b2cffe079c02/ids
2024-05-25 12:11:52 893 [92233]: s1 set_lockspace_max_sectors_kb hw_kb 32767 set 1024 error -1
2024-05-25 12:11:52 893 [92233]: s1 delta_acquire begin 4484db8c-4e14-4f86-852b-b2cffe079c02:250
2024-05-25 12:11:52 893 [92233]: s1 delta_acquire write 250 93 893 4c4c4544-0051-5210-8038-c8c04f585032
2024-05-25 12:11:52 893 [92233]: s1 delta_acquire delta_short_delay 20
2024-05-25 12:12:12 913 [92233]: s1 delta_acquire done 250 93 893
2024-05-25 12:12:13 914 [53345]: s1 add_lockspace done
2024-05-25 12:12:13 914 [53345]: cmd_add_lockspace 2,12 done 0
2024-05-25 12:12:14 914 [53295]: s1 host 1 67 676851 4c4c4544-004e-4210-804a-b5c04f465632
2024-05-25 12:12:14 914 [53295]: s1 host 2 74 673519 4c4c4544-004e-3710-804b-b5c04f465632
2024-05-25 12:12:14 914 [53295]: s1 host 3 70 0 4c4c4544-004e-3910-8043-b5c04f465632
2024-05-25 12:12:14 914 [53295]: s1 host 4 66 677373 4c4c4544-004e-4310-8043-b5c04f465632
2024-05-25 12:12:14 914 [53295]: s1 host 5 22 676370 4c4c4544-0054-4a10-804c-b7c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 6 19 676429 4c4c4544-0057-4a10-804c-b7c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 7 21 674418 4c4c4544-0054-4a10-804c-b6c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 8 19 674396 4c4c4544-0057-4a10-804c-b6c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 9 15 674430 4c4c4544-0057-4a10-804c-b8c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 10 17 674403 4c4c4544-0054-4a10-804c-b9c04f354233
2024-05-25 12:12:14 914 [53295]: s1 host 250 93 893 4c4c4544-0051-5210-8038-c8c04f585032

Now here comes the problem, the next step is it needs to acquire the SDM resource (leases lockspace).

2024-05-25 12:12:16 917 [53346]: cmd_acquire 2,12,56848 ci_in 2 fd 12 count 1 flags 0
2024-05-25 12:12:16 917 [53346]: s1:r2 resource 4484db8c-4e14-4f86-852b-b2cffe079c02:SDM:/dev/4484db8c-4e14-4f86-852b-b2cffe079c02/leases:1048576
2024-05-25 12:12:16 917 [53346]: s1:r2 paxos_acquire begin offset 1048576 0x2 512 1048576
2024-05-25 12:12:16 917 [53346]: s1:r2 leader 562 owner 5 22 348655 dblocks 4:1122005:1122005:5:22:348655:562:0,
2024-05-25 12:12:16 917 [53346]: s1:r2 paxos_acquire leader 562 owner 5 22 348655 max mbal[4] 1122005 our_dblock 1102250 1102250 250 92 4471 552
2024-05-25 12:12:16 917 [53346]: s1:r2 paxos_acquire owner 5 22 348655 host_status 5 22 676370 wait_start 914
2024-05-25 12:12:16 917 [53346]: s1:r2 paxos_acquire owner 5 delta 5 22 676370 alive
2024-05-25 12:12:16 917 [53346]: s1:r2 acquire_disk rv -243 lver 562 at 348655
2024-05-25 12:12:16 917 [53346]: s1:r2 acquire_token held error -243
2024-05-25 12:12:16 917 [53346]: s1:r2 release_token r_flags 0 lver 0
2024-05-25 12:12:16 917 [53346]: s1:r2 release_token done r_flags 0
2024-05-25 12:12:16 917 [53346]: s1:r2 cmd_acquire 2,12,56848 acquire_token -243 lease owned by other host

More than half an hour later, we are still failing to grab SDM from the dead host, which has been powered off for quite some time now. Sanlock keeps going to that IDLIVE if clause:

2024-05-25 12:45:34 2914 [53345]: s10:r11 resource 4484db8c-4e14-4f86-852b-b2cffe079c02:SDM:/dev/4484db8c-4e14-4f86-852b-b2cffe079c02/leases:1048576
2024-05-25 12:45:34 2914 [53345]: s10:r11 paxos_acquire begin offset 1048576 0x2 512 1048576
2024-05-25 12:45:34 2914 [53345]: s10:r11 leader 562 owner 5 22 348655 dblocks 4:1122005:1122005:5:22:348655:562:0,
2024-05-25 12:45:34 2914 [53345]: s10:r11 paxos_acquire leader 562 owner 5 22 348655 max mbal[4] 1122005 our_dblock 1102250 1102250 250 92 4471 552
2024-05-25 12:45:34 2914 [53345]: s10:r11 paxos_acquire owner 5 22 348655 host_status 5 22 676370 wait_start 2912
2024-05-25 12:45:34 2914 [53345]: s10:r11 paxos_acquire owner 5 delta 5 22 676370 alive
2024-05-25 12:45:34 2914 [53345]: s10:r11 acquire_disk rv -243 lver 562 at 348655
2024-05-25 12:45:34 2914 [53345]: s10:r11 acquire_token held error -243 
2024-05-25 12:45:34 2914 [53345]: s10:r11 release_token r_flags 0 lver 0 
2024-05-25 12:45:34 2914 [53345]: s10:r11 release_token done r_flags 0
2024-05-25 12:45:34 2914 [53345]: s10:r11 cmd_acquire 2,12,56848 acquire_token -243 lease owned by other host

Note the host timestamp is still the exact same 676370, this current owner is definitely not alive.

So it kept returning IDLIVE. Note the lease was owned by host with id 5 on the primary site, and that timestamp is still 676370. This host is powered off, with no access to storage, that site lost power.

My understanding is that for some reason we entered this if clause to get to IDLIVE

if ((host_id_leader.timestamp != last_timestamp) ||		        (hs.last_live && (hs.last_check == hs.last_live))) {

IMHO it shoulnd't be the first line, as that would be 676370 != 0000676370, which is false. A direct dump at the time:

  offset                            lockspace                                         resource  timestamp  own  gen lver
00002048 4484db8c-4e14-4f86-852b-b2cffe079c02             4c4c4544-0054-4a10-804c-b7c04f354233 0000676370 0005 0022

The sanlock logs seem to suggest it simply jumps into a conclusion that the host is live where it isn't. AFAIK, to be considered live, the sanlock on the host needs to see a renewal from the host currently owning the lock, but it doesn't, nor tries to wait.
I could be way off, but my impression is we are taking this shortcut introduced here, but I don't really understand why.

commit f1d109a5b5017c431db16b02f056a806f584c077
Date:   Fri Jan 30 16:08:37 2015 -0600
 
    sanlock: fail acquire with IDLIVE more quickly

Maybe going into this part of the if clause instead of waiting to see fi the timestamp changes:

(hs.last_live && (hs.last_check == hs.last_live)))

Please provide the package NVR for which bug is seen:

sanlock-3.8.4-3.el8.x86_64

How reproducible:

Unable so far, it works for me, and it does what I expect it to at that stage:

2024-05-27 12:26:55 463 [815]: s1:r2 resource dacc82b4-abb0-4905-957c-4afcc5e586db:SDM:/dev/dacc82b4-abb0-4905-957c-4afcc5e586db/leases:1048576
2024-05-27 12:26:55 463 [815]: s1:r2 paxos_acquire begin offset 1048576 0x2 512 1048576
2024-05-27 12:26:55 463 [815]: s1:r2 leader 15 owner 1 14 1338 dblocks 0:28001:28001:1:14:1338:15:0,
2024-05-27 12:26:55 463 [815]: s1:r2 paxos_acquire leader 15 owner 1 14 1338 max mbal[0] 28001 our_dblock 26250 26250 250 2 1172 14
2024-05-27 12:26:55 463 [815]: s1 host_info 1 use own io_timeout 10
2024-05-27 12:26:55 463 [815]: s1:r2 paxos_acquire owner 1 14 1338 host_status 0 0 0 wait_start 463
2024-05-27 12:26:56 464 [808]: s1 host 1 14 3123 5ec17b69-be22-5d56-bfcc-44580c0a49dc
2024-05-27 12:26:56 464 [808]: s1 host 2 9 182972 75b0ce70-0018-5a47-9457-a286253002ca
2024-05-27 12:26:56 464 [808]: s1 host 250 3 442 75b0ce70-0018-5a47-9457-a286253002ca
2024-05-27 12:26:56 464 [815]: s1:r2 paxos_acquire leader 15 owner 1 14 1338
....
2024-05-27 12:29:16 604 [815]: s1:r2 paxos_acquire leader 15 owner 1 14 1338
2024-05-27 12:29:16 604 [815]: s1:r2 paxos_acquire owner 1 14 1338 delta 1 14 3123 dead 604-463>140
2024-05-27 12:29:16 604 [815]: s1:r2 paxos_acquire leader 15 owner 1 14 1338
2024-05-27 12:29:16 604 [815]: s1:r2 ballot 16 phase1 write mbal 30250
2024-05-27 12:29:16 604 [815]: s1:r2 ballot 16 phase1 read 249:30250:0:0:0:0:16:0,
2024-05-27 12:29:16 604 [815]: s1:r2 ballot 16 phase2 write bal 30250 inp 250 3 604 q_max -1
2024-05-27 12:29:16 604 [815]: s1:r2 ballot 16 phase2 read 249:30250:30250:250:3:604:16:0,
2024-05-27 12:29:16 604 [815]: s1:r2 ballot 16 commit self owner 250 3 604
2024-05-27 12:29:16 604 [815]: s1:r2 acquire_disk rv 1 lver 16 at 604

Expected results

SDM lease is acquired by the host, and DR can proceed

Actual results

Host on DR site cannot get SDM lease, DR failover does not work.

links to

[KCS] Importing a replicated storage domain in disaster recovery site fails with a "Cannot acquire Lease" sanlock error

Assignee:: David Teigland

Reporter:: Germano Veit Michel

Developer:: David Teigland

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/05/27 2:31 AM

Updated:: 2024/09/26 3:02 AM

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Expected results

Actual results

Attachments

Issue Links

Activity

People

Dates