Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7620

When root volume was unavailable on DC, the node is running but not functioning as expected.

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Medium
    • rhel-ha
    • ssg_filesystems_storage_and_HA
    • 13
    • 5
    • Dev ack
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      Description of problem:

      When root volume was unavailable on DC node, the node is still running as a member but not functioning as expected.

      Version-Release number of selected component (if applicable):

      pacemaker-2.0.3-5.el8_2.1.x86_64
      corosync-3.0.3-2.el8.x86_64
      pcs-0.10.4-6.el8_2.1.x86_64

      How reproducible:

      Always (from my testing)

      Steps to Reproduce:

      Testing on none DC node.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node2 ~]# pcs constraint --full
      Location Constraints:
      Resource: test1
      Enabled on:
      Node: ha8node1 (score:INFINITY) (id:location-test1-ha8node1-INFINITY)
      Ordering Constraints:
      Colocation Constraints:
      Ticket Constraints:
      [root@ha8node2 ~]#

      [root@ha8node1 ~]# watch -n 1 pcs status

      Every 1.0s: pcs status ha8node1: Tue Jul 7 16:19:57 2020

      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 16:19:57 2020
      • Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node1

      Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled

      [root@ha8node1 ~]# pcs resource config test1
      Resource: test1 (class=ocf provider=pacemaker type=Dummy)
      Operations: migrate_from interval=0s timeout=20s (test1-migrate_from-interval-0s)
      migrate_to interval=0s timeout=20s (test1-migrate_to-interval-0s)
      monitor interval=5 on-fail=fence timeout=5 (test1-monitor-interval-5)
      reload interval=0s timeout=20s (test1-reload-interval-0s)
      start interval=0s timeout=20s (test1-start-interval-0s)
      stop interval=0s timeout=20s (test1-stop-interval-0s)
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Disable root volume on node2
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node2 ~]# dmsetup suspend rhel-root
      [root@ha8node2 ~]#
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      In a few minutes, nothing happened to the cluster:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 16:21:30 2020
      • Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ] <=============

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node1

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      enable root volume on node2
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node2 ~]# dmsetup resume rhel-root
      [root@ha8node2 ~]#
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Still nothing happened to the cluster:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 16:22:40 2020
      • Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node1

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Disable root volume on node2 again
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node2 ~]# dmsetup suspend rhel-root
      [root@ha8node2 ~]#
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Trying to move resource 'test1' to node2
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node1 ~]# pcs resource move test1
      Warning: Creating location constraint 'cli-ban-test1-on-ha8node1' with a score of -INFINITY for resource test1 on ha8node1.
      This will prevent test1 from running on ha8node1 until the constraint is removed
      This will be the case even if ha8node1 is the last node in the cluster
      [root@ha8node1 ~]#
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      'test1' didn't move to node2 for few minutes, but became to 'Stopped', and 'FAILED' status,
      the node2 got fenced in the end and Started on it. It takes sometime but works!
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 16:24:49 2020
      • Last change: Tue Jul 7 16:23:37 2020 by root via crm_resource on ha8node1
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Stopped <==============

      ...

      later, the resource test1 becomes 'FAILED' status, and to 'Stopped' and the node got fenced.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      However, if you test it on DC node.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 17:03:51 2020
      • Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node2

      Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled

      [root@ha8node1 ~]# pcs constraint --full
      Location Constraints:
      Ordering Constraints:
      Colocation Constraints:
      Ticket Constraints:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node1 ~]# dmsetup suspend rhel-root
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Nothing happens for about 10 minutes...
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 17:10:45 2020
      • Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node2

      Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      trying to move the resource 'test1' to node1 which is DC with unavailable root volume.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node2 ~]# pcs resource move test1
      Warning: Creating location constraint 'cli-ban-test1-on-ha8node2' with a score of -INFINITY for resource test1 on ha8node2.
      This will prevent test1 from running on ha8node2 until the constraint is removed
      This will be the case even if ha8node2 is the last node in the cluster
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      nothing happens.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ..
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 17:25:57 2020
      • Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node2

      ..
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      it would start working again as if there was no issue when the root valume is made available.
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [root@ha8node1 ~]# dmsetup resume rhel-root
      [root@ha8node1 ~]#
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Cluster name: ha8_cluster
      Cluster Summary:

      • Stack: corosync
      • Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
      • Last updated: Tue Jul 7 17:26:34 2020
      • Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
      • 2 nodes configured
      • 7 resource instances configured

      Node List:

      • Online: [ ha8node1 ha8node2 ]

      Full List of Resources:

      • xvmfence1 (stonith:fence_xvm): Started ha8node1
      • xvmfence2 (stonith:fence_xvm): Started ha8node2
      • Resource Group: webservice:
      • VIP (ocf::heartbeat:IPaddr2): Started ha8node1
      • WebSite (ocf::heartbeat:apache): Started ha8node1
      • lvm (ocf::heartbeat:LVM-activate): Started ha8node1
      • cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
      • test1 (ocf::pacemaker:Dummy): Started ha8node1

      Failed Resource Actions:

      • VIP_monitor_10000 on ha8node1 'error' (1): call=43, status='Timed Out', exitreason='', last-rc-change='2020-07-07 17:26:24 +09:00', q
        ueued=0ms, exec=0ms

      Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      Actual results:
      The DC node with unavailable root volume remains as a member of the cluster but not functioning (unable to host new resources)

      Expected results:
      The DC node with unavailable root volume should be fenced or removed from the cluster.
      Maybe one of none DC node should check if cib on DC is working fine.

      Additional info:

      This is a similar bug with 1725236 which is fixed.

              rhn-engineering-kwenning Klaus Wenninger
              rhn-support-jseunghw Hwanii Seung Hwan Jung (Inactive)
              Klaus Wenninger Klaus Wenninger
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: