-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhel-8.1.0, rhel-9.4
-
None
-
Important
-
sst_high_availability
-
ssg_filesystems_storage_and_HA
-
13
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
All
-
None
Description of problem:
Assume that a ticket has a before-acquire-handler that checks some external condition to determine whether a site should be allowed to acquire a ticket. The handler succeeds at ticket grant time, so the ticket is granted (say, to site 1).
Then later, when it's time to renew the ticket (after renewal-freq), the handler runs again and fails. The ticket is released, and site 2 can try to acquire it.
When site 2 tries to acquire the released ticket, it does NOT run the before-acquire-handler. Thus, it may acquire a ticket when it is not safe to do so, because the check does not execute. Only after the renewal-freq duration passes at site 2 will the before-acquire-handler run.
In the demonstration below, I (like the customer who reported this) use a geostore attribute to mark whether a site is allowed to acquire a ticket.
DEMONSTRATION:
Booth environment:
Site 1 (192.168.22.71):
fastvm-rhel-8-0-23
fastvm-rhel-8-0-24
Site 2 (192.168.22.81):
fastvm-rhel-8-0-33
fastvm-rhel-8-0-34
Arbitrator (192.168.22.52):
fastvm-rhel-8-0-52
- # ticket configuration
[root@fastvm-rhel-8-0-23 ~]# cat /etc/booth/booth.conf
~~~
...
ticket = "apacheticket"
expire = 120
renewal-freq = 60
retries = 4
timeout = 10
before-acquire-handler = /tmp/before-acquire-handler.sh
~~~
- # before-acquire-handler at site 1
- # site 2 is the same but uses booth_ip=192.168.22.81
- # Exit 0 if SAFE_TO_ACTIVATE geostore attribute is set to 1, else exit 1
[root@fastvm-rhel-8-0-23 ~]# cat /tmp/before-acquire-handler.sh
~~~
#!/bin/bash
begin_date=$(date)
h_name=$(hostname -s)
ticket_name=apacheticket
booth_ip=192.168.22.71
SAFE_TO_ACTIVATE=$(geostore get -t "$ticket_name" -s "$booth_ip" SAFE_TO_ACTIVATE)
echo "$begin_date - $h_name: Floating Cluster IP=$booth_ip, Ticket Name=$ticket_name, ACTIVATE=$SAFE_TO_ACTIVATE" >> /tmp/before-acquire-handler.log
[[ $SAFE_TO_ACTIVATE -eq 1 ]] && exit 0 || exit 1
~~~
- # Initially, SAFE_TO_ACTIVATE=0 at both sites
[root@fastvm-rhel-8-0-23 ~]# geostore get -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE
0
[root@fastvm-rhel-8-0-23 ~]# geostore get -t apacheticket -s 192.168.22.81 SAFE_TO_ACTIVATE
0
- # Grant fails as it should because SAFE_TO_ACTIVATE != 1
[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket 192.168.22.71
Error: unable to grant booth ticket 'apacheticket' for site '192.168.22.71', reason: Jan 10 20:05:57 fastvm-rhel-8-0-23 booth: [4856]: info: grant request sent, waiting for the result ...
Jan 10 20:05:57 fastvm-rhel-8-0-23 booth: [4856]: error: before-acquire-handler for ticket "apacheticket" failed, grant denied
- # Set SAFE_TO_ACTIVATE=1 at site 1
[root@fastvm-rhel-8-0-23 ~]# geostore set -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE 1
Jan 10 20:06:49 fastvm-rhel-8-0-23 geostore: [5278]: info: set succeeded!
- # Grant succeeds as it should
[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket 192.168.22.71
[root@fastvm-rhel-8-0-23 ~]# booth list
ticket: apacheticket, leader: 192.168.22.71, expires: 2020-01-10 20:09:38
- # Set SAFE_TO_ACTIVATE=0 so that the before-acquire-handler fails upon ticket renewal
[root@fastvm-rhel-8-0-23 ~]# geostore set -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE 0
Jan 10 20:07:56 fastvm-rhel-8-0-23 geostore: [5810]: info: set succeeded!
- # Watch it fail at renewal
[root@fastvm-rhel-8-0-23 ~]# tail -f /var/log/messages
...
Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [warning] apacheticket (Lead/19/59806): handler "/tmp/before-acquire-handler.sh" failed: exit code 1
Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [warning] apacheticket (Lead/19/59806): we are not allowed to acquire ticket
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6181]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6181]: notice: Invoked: crm_ticket -t apacheticket -S owner -v 0
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6200]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6200]: notice: Invoked: crm_ticket -t apacheticket -S expires -v 1578715718
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: Transition 303 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-351.bz2): Complete
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: Calculated transition 303, saving inputs in /var/lib/pacemaker/pengine/pe-input-351.bz2
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6210]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6210]: notice: Invoked: crm_ticket -t apacheticket -S term -v 19
Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [info] setting crm_ticket attributes successful
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: Calculated transition 304, saving inputs in /var/lib/pacemaker/pengine/pe-input-352.bz2
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6217]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6217]: notice: Invoked: crm_ticket -t apacheticket -r --force
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: Transition 304 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-352.bz2): Complete
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: * Stop dummy1 ( fastvm-rhel-8-0-23 ) due to node availability
- # Site 2 acquires the ticket without running before-acquire-handler
Jan 10 20:08:37 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Fllw/19/58738): 192.168.22.71 wants to give the ticket away (ticket release)
Jan 10 20:08:37 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful
Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Fllw/20/0): starting new election (term=20)
Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Lead/20/119999): granted successfully here
Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful
Jan 10 20:08:38 fastvm-rhel-8-0-33 pacemaker-controld[1365]: notice: Result of start operation for dummy1 on fastvm-rhel-8-0-33: 0 (ok)
- # Site 2 fails before-acquire-handler upon ticket renewal
Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [warning] apacheticket (Lead/20/59808): handler "/tmp/before-acquire-handler.sh" failed: exit code 1
Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [warning] apacheticket (Lead/20/59807): we are not allowed to acquire ticket
Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful
Jan 10 20:09:38 fastvm-rhel-8-0-33 pacemaker-controld[1365]: notice: Result of stop operation for dummy1 on fastvm-rhel-8-0-33: 0 (ok)
Jan 10 20:09:39 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful
Version-Release number of selected component (if applicable):
booth-core-1.0-5.f2d38ce.git.el8.x86_64
booth-site-1.0-5.f2d38ce.git.el8.noarch
pacemaker-2.0.2-3.el8_1.2.x86_64
How reproducible:
Always
Steps to Reproduce:
1. Configure a booth ticket with a before-acquire-handler. Ensure that the handler is set up to succeed at site 1 and to fail at site 2.
2. Grant the ticket to site 1.
3. Cause the before-acquire-handler to fail at ticket renewal time at site 1.
4. Observe site 1 release the ticket and site 2 try to acquire it.
Actual results:
Site 2 does not run the before-acquire-handler and successfully acquires the released ticket.
Expected results:
Site 2 runs the before-acquire-handler and is not allowed to acquire the ticket.
Additional info:
I noticed this closed issue is from the customer who also reported it via a Red Hat support case: https://github.com/ClusterLabs/booth/issues/79.