Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-123159

booth fails to load ticket state via crm_ticket after a node reboot

Linking RHIVOS CVEs to...Migration: Automation ...RHELPRIO AssignedTeam ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-10.0
    • booth
    • None
    • None
    • None
    • rhel-ha
    • 5
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • x86_64
    • None

      What were you trying to do that didn't work?

      SCENARIO
      cluster1 running on one site, cluster2 running on the second site, arbitrator running on a third site. Booth ticket granted to cluster2.

      ISSUE DESCRIPTION
      After a reset of cluster2, simulating a disaster, booth on the Leader starts without ticket state. It tries to load it from pcmk using 'crm_ticket' and fails with exit code 105. It looks like pcmk is blocking the connection for a certain time after pcmk startup.

      Oct 21 15:29:01 cluster2-node1.example.com boothd-site[1877]: [debug] reset_ticket:531: apacheticket (Init/0/0): next state reset
      Oct 21 15:29:01 cluster2-node1.example.com boothd-site[1877]: [error] apacheticket (Init/0/0): crm_ticket xml output empty
      Oct 21 15:29:01 cluster2-node1.example.com boothd-site[1877]: [warning] apacheticket: no site matches; site got reconfigured?
      Oct 21 15:29:01 cluster2-node1.example.com boothd-site[1877]: [error] command "crm_ticket -t 'apacheticket' -q" exit code 105
      Oct 21 15:29:01 cluster2-node1.example.com boothd-site[1877]: [info] apacheticket (Init/0/0): broadcasting state query

      Booth is able to get the ticket information from another Follower booth instance, but only if at least one booth instance "survives". If all instances are reset, the ticket is lost.

      Booth is able to get the ticket information from pcmk if only the daemon is restarted on a running system (instead of a restart of the whole node): 

      [root@cluster2-node1 ~]# ps aux | grep booth
      haclust+    1922  0.0  0.4  15932 15676 ?        SLs  17:18   0:00 boothd daemon -c /etc/booth/booth.conf
      root       16865  0.0  0.0   6380  2016 pts/0    S+   17:54   0:00 grep --color=auto booth
      [root@cluster2-node1 ~]# killall -9 boothd
      [root@cluster2-node1 ~]# ps aux | grep booth
      haclust+   17069  0.0  0.4  16060 15920 ?        SLs  17:54   0:00 boothd daemon -c /etc/booth/booth.conf
      root       18503  0.0  0.0   6380  2064 pts/0    S+   17:57   0:00 grep --color=auto booth

      Oct 21 17:54:27 cluster2-node1.example.com booth[17068]: [debug] read key of size 64 in authfile /etc/booth/booth.key
      Oct 21 17:54:27 cluster2-node1.example.com booth[17068]: [debug] found myself at 192.168.100.250 (32 bits matched)
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [info] BOOTH site 1.2 daemon is starting
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [debug] disown_ticket:509: apacheticket (/0/0): ticket leader set to NONE
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [debug] reset_ticket:530: apacheticket (/0/0): state transition:  -> Init
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [debug] reset_ticket:531: apacheticket (Init/0/0): next state reset
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [debug] command "crm_ticket -t 'apacheticket' -q"
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [debug] update_ticket_state:606: apacheticket (Init/0/195899): next state set to Lead
      Oct 21 17:54:27 cluster2-node1.example.com boothd-site[17069]: [info] apacheticket (Init/0/195899): broadcasting state query

      Please provide the package NVR for which the bug is seen:

      booth-core-1.2-3.el10.x86_64
      booth-site-1.2-3.el10.noarch
      pcs-0.12.0-3.el10_0.2.x86_64
      pacemaker-schemas-3.0.0-5.1.el10_0.noarch
      pacemaker-libs-3.0.0-5.1.el10_0.x86_64
      pacemaker-cluster-libs-3.0.0-5.1.el10_0.x86_64
      pacemaker-3.0.0-5.1.el10_0.x86_64
      pacemaker-cli-3.0.0-5.1.el10_0.x86_64

      How reproducible is this bug?:

      Reproducible

      Steps to reproduce

       

      • Reset of cluster2 with cluster1 and arbitrator always up. Result: kept granted ticket (cluster1 and arbitrator can still provide the info)
      • Reset of cluster2 with cluster1 off and arbitrator always up. Result: kept granted ticket (arbitrator can still provide the info)
      • Poweroff/on of cluster2 with cluster1 off, arbitrator reset. Result: ticket lost.
      • Poweroff/on of cluster2 with cluster1 on, arbitrator reset. Result: kept granted ticket (cluster1 can still provide the info)
      • Poweroff/on of cluster2 with cluster1 off, arbitrator off, then in order cluster1 on, cluster2 on. Result: ticket lost.

        Expected results

      Booth is able to get the ticket information from pcmk even after a reset of all booth instances.

      Actual results

      Booth is able to get the ticket information from another Follower booth instance only if at least one booth instance "survives". If all instances are reset, the ticket is lost.

              rhn-support-nwahl Reid Wahl
              rhn-support-rfurlan Riccardo Furlan
              Christopher Lumens Christopher Lumens
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: