Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7713

Corosync hosts frequently lose connection to peers on Azure VMs.

    • None
    • Moderate
    • rhel-sst-high-availability
    • ssg_filesystems_storage_and_HA
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:
      Microsoft Azure documentation states that the totem token in the Corosync configuration file should be set to 30000 to allow for memory preserving maintenance.

      https://learn.microsoft.com/en-us/azure/sap/workloads/high-availability-guide-rhel-pacemaker?tabs=msi

      Sometimes we still see Corosync losing connection to its peers even with the 30000 token setting. However, from the Corosync log it looks like its only waiting 10 second before forming new membership

      Jan 27 02:48:49.832 [14503] <Hostname> corosync notice [TOTEM ] totemsrp.c:timer_function_orf_token_warning:1730 Token has not been received in 7500 ms
      Jan 27 02:48:52.332 [14503] <Hostname> corosync notice [TOTEM ] totemsrp.c:timer_function_orf_token_timeout:1746 A processor failed, forming new configuration.
      Jan 27 02:48:57.800 [14503] <Hostname> corosync info [KNET ] libknet.h:log_deliver_fn:682 rx: host: 1 link: 0 is up
      Jan 27 02:48:57.800 [14503] <Hostname> corosync info [KNET ] libknet.h:log_deliver_fn:682 host: host: 1 (passive) best link: 0 (pri: 1)
      Jan 27 02:49:04.337 [14503] <Hostname> corosync notice [TOTEM ] totemsrp.c:memb_state_operational_enter:2096 A new membership (2.93) was formed. Members left: 1
      Jan 27 02:49:04.337 [14503] <Hostname> corosync notice [TOTEM ] totemsrp.c:memb_state_operational_enter:2101 Failed to receive the leave message. failed: 1
      Jan 27 02:49:04.337 [14503] <Hostname> corosync notice [QUORUM] vsf_quorum.c:log_view_list:131 Members[1]: 2
      Jan 27 02:49:04.337 [14503] <Hostname> corosync notice [MAIN ] main.c:corosync_sync_completed:296 Completed service synchronization, ready to provide service.

      For reference here is corosync.conf, and corosync_cmapctl output.
      corosync.conf
      totem {
      version: 2
      cluster_name: <HA Cluster>
      transport: knet
      token: 30000
      crypto_cipher: aes256
      crypto_hash: sha256
      }

      From corosync_cmapctl
      runtime.config.totem.token (u32) = 30000
      runtime.config.totem.token_retransmit (u32) = 7142
      runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
      runtime.config.totem.token_warning (u32) = 75

      Edit
      Based on the above I have the following questions:
      1. How can I be sure that Corosync is honoring the 30 seconds token timeout?
      2. Are there any additional Corosync (or Pacemaker) configurations/workarounds recommended for Azure cloud? Any known problems with Corosync/Pacemaker on Azure?

      Version-Release number of selected component (if applicable):
      corosync-3.0.4-2

      How reproducible:
      Not reproducible on demand.

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

              rhn-engineering-jfriesse Jan Friesse
              jira-bugzilla-migration RH Bugzilla Integration
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: