-
Story
-
Resolution: Done
-
Major
-
rhel-8.6.0
-
Medium
-
rhel-ha
-
ssg_filesystems_storage_and_HA
-
13
-
5
-
Dev ack
-
False
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
All
-
None
-
57,005
Description of problem:
When root volume was unavailable on DC node, the node is still running as a member but not functioning as expected.
Version-Release number of selected component (if applicable):
pacemaker-2.0.3-5.el8_2.1.x86_64
corosync-3.0.3-2.el8.x86_64
pcs-0.10.4-6.el8_2.1.x86_64
How reproducible:
Always (from my testing)
Steps to Reproduce:
Testing on none DC node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs constraint --full
Location Constraints:
Resource: test1
Enabled on:
Node: ha8node1 (score:INFINITY) (id:location-test1-ha8node1-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
[root@ha8node2 ~]#
[root@ha8node1 ~]# watch -n 1 pcs status
Every 1.0s: pcs status ha8node1: Tue Jul 7 16:19:57 2020
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 16:19:57 2020
- Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@ha8node1 ~]# pcs resource config test1
Resource: test1 (class=ocf provider=pacemaker type=Dummy)
Operations: migrate_from interval=0s timeout=20s (test1-migrate_from-interval-0s)
migrate_to interval=0s timeout=20s (test1-migrate_to-interval-0s)
monitor interval=5 on-fail=fence timeout=5 (test1-monitor-interval-5)
reload interval=0s timeout=20s (test1-reload-interval-0s)
start interval=0s timeout=20s (test1-start-interval-0s)
stop interval=0s timeout=20s (test1-stop-interval-0s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Disable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In a few minutes, nothing happened to the cluster:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 16:21:30 2020
- Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ] <=============
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
enable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup resume rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Still nothing happened to the cluster:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 16:22:40 2020
- Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Disable root volume on node2 again
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Trying to move resource 'test1' to node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node1' with a score of -INFINITY for resource test1 on ha8node1.
This will prevent test1 from running on ha8node1 until the constraint is removed
This will be the case even if ha8node1 is the last node in the cluster
[root@ha8node1 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'test1' didn't move to node2 for few minutes, but became to 'Stopped', and 'FAILED' status,
the node2 got fenced in the end and Started on it. It takes sometime but works!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 16:24:49 2020
- Last change: Tue Jul 7 16:23:37 2020 by root via crm_resource on ha8node1
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Stopped <==============
...
later, the resource test1 becomes 'FAILED' status, and to 'Stopped' and the node got fenced.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
However, if you test it on DC node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 17:03:51 2020
- Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node2
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@ha8node1 ~]# pcs constraint --full
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup suspend rhel-root
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nothing happens for about 10 minutes...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 17:10:45 2020
- Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node2
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
trying to move the resource 'test1' to node1 which is DC with unavailable root volume.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node2' with a score of -INFINITY for resource test1 on ha8node2.
This will prevent test1 from running on ha8node2 until the constraint is removed
This will be the case even if ha8node2 is the last node in the cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nothing happens.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 17:25:57 2020
- Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node2
..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
it would start working again as if there was no issue when the root valume is made available.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup resume rhel-root
[root@ha8node1 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
- Stack: corosync
- Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
- Last updated: Tue Jul 7 17:26:34 2020
- Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
- 2 nodes configured
- 7 resource instances configured
Node List:
- Online: [ ha8node1 ha8node2 ]
Full List of Resources:
- xvmfence1 (stonith:fence_xvm): Started ha8node1
- xvmfence2 (stonith:fence_xvm): Started ha8node2
- Resource Group: webservice:
- VIP (ocf::heartbeat:IPaddr2): Started ha8node1
- WebSite (ocf::heartbeat:apache): Started ha8node1
- lvm (ocf::heartbeat:LVM-activate): Started ha8node1
- cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
- test1 (ocf::pacemaker:Dummy): Started ha8node1
Failed Resource Actions:
- VIP_monitor_10000 on ha8node1 'error' (1): call=43, status='Timed Out', exitreason='', last-rc-change='2020-07-07 17:26:24 +09:00', q
ueued=0ms, exec=0ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Actual results:
The DC node with unavailable root volume remains as a member of the cluster but not functioning (unable to host new resources)
Expected results:
The DC node with unavailable root volume should be fenced or removed from the cluster.
Maybe one of none DC node should check if cib on DC is working fine.
Additional info:
This is a similar bug with 1725236 which is fixed.