Loading...

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: rhel-8.6.0
Affects Version/s: rhel-8.6.0
Component/s: pacemaker
Labels:
- MigratedToJIRA

Severity:
Medium

AssignedTeam:
rhel-ha
Sub-System Group:

ssg_filesystems_storage_and_HA

Dev Target Milestone:
13
Story Points:
5
ACKs Check:

Dev ack
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Release Note Type:
If docs needed, set a value

Experience:
Architecture:

All
Bugzilla Bug:
RHBZ: 1854340

PX Impact Score:
PX Priority Data:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None
Internal Target Milestone numeric:
57,005

Description of problem:

When root volume was unavailable on DC node, the node is still running as a member but not functioning as expected.

Version-Release number of selected component (if applicable):

pacemaker-2.0.3-5.el8_2.1.x86_64
corosync-3.0.3-2.el8.x86_64
pcs-0.10.4-6.el8_2.1.x86_64

How reproducible:

Always (from my testing)

Steps to Reproduce:

Testing on none DC node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs constraint --full
Location Constraints:
Resource: test1
Enabled on:
Node: ha8node1 (score:INFINITY) (id:location-test1-ha8node1-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
[root@ha8node2 ~]#

[root@ha8node1 ~]# watch -n 1 pcs status

Every 1.0s: pcs status ha8node1: Tue Jul 7 16:19:57 2020

Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 16:19:57 2020
Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

[root@ha8node1 ~]# pcs resource config test1
Resource: test1 (class=ocf provider=pacemaker type=Dummy)
Operations: migrate_from interval=0s timeout=20s (test1-migrate_from-interval-0s)
migrate_to interval=0s timeout=20s (test1-migrate_to-interval-0s)
monitor interval=5 on-fail=fence timeout=5 (test1-monitor-interval-5)
reload interval=0s timeout=20s (test1-reload-interval-0s)
start interval=0s timeout=20s (test1-start-interval-0s)
stop interval=0s timeout=20s (test1-stop-interval-0s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Disable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In a few minutes, nothing happened to the cluster:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 16:21:30 2020
Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ] <=============

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

enable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup resume rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Still nothing happened to the cluster:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 16:22:40 2020
Last change: Tue Jul 7 16:19:36 2020 by root via cibadmin on ha8node2
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Disable root volume on node2 again
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Trying to move resource 'test1' to node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node1' with a score of -INFINITY for resource test1 on ha8node1.
This will prevent test1 from running on ha8node1 until the constraint is removed
This will be the case even if ha8node1 is the last node in the cluster
[root@ha8node1 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

'test1' didn't move to node2 for few minutes, but became to 'Stopped', and 'FAILED' status,
the node2 got fenced in the end and Started on it. It takes sometime but works!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 16:24:49 2020
Last change: Tue Jul 7 16:23:37 2020 by root via crm_resource on ha8node1
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Stopped <==============

...

later, the resource test1 becomes 'FAILED' status, and to 'Stopped' and the node got fenced.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

However, if you test it on DC node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 17:03:51 2020
Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node2

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

[root@ha8node1 ~]# pcs constraint --full
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup suspend rhel-root
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Nothing happens for about 10 minutes...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 17:10:45 2020
Last change: Tue Jul 7 17:01:49 2020 by root via cibadmin on ha8node1
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node2

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

trying to move the resource 'test1' to node1 which is DC with unavailable root volume.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node2' with a score of -INFINITY for resource test1 on ha8node2.
This will prevent test1 from running on ha8node2 until the constraint is removed
This will be the case even if ha8node2 is the last node in the cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

nothing happens.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 17:25:57 2020
Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node2

..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

it would start working again as if there was no issue when the root valume is made available.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup resume rhel-root
[root@ha8node1 ~]#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:

Stack: corosync
Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
Last updated: Tue Jul 7 17:26:34 2020
Last change: Tue Jul 7 17:11:19 2020 by root via crm_resource on ha8node2
2 nodes configured
7 resource instances configured

Node List:

Online: [ ha8node1 ha8node2 ]

Full List of Resources:

xvmfence1 (stonith:fence_xvm): Started ha8node1
xvmfence2 (stonith:fence_xvm): Started ha8node2
Resource Group: webservice:
VIP (ocf::heartbeat:IPaddr2): Started ha8node1
WebSite (ocf::heartbeat:apache): Started ha8node1
lvm (ocf::heartbeat:LVM-activate): Started ha8node1
cluster_fs (ocf::heartbeat:Filesystem): Started ha8node1
test1 (ocf::pacemaker:Dummy): Started ha8node1

Failed Resource Actions:

VIP_monitor_10000 on ha8node1 'error' (1): call=43, status='Timed Out', exitreason='', last-rc-change='2020-07-07 17:26:24 +09:00', q
ueued=0ms, exec=0ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Actual results:
The DC node with unavailable root volume remains as a member of the cluster but not functioning (unable to host new resources)

Expected results:
The DC node with unavailable root volume should be fenced or removed from the cluster.
Maybe one of none DC node should check if cib on DC is working fine.

Additional info:

This is a similar bug with 1725236 which is fixed.

external trackers

PnT-DevOps Jira CLUSTERQE-4549

PnT-DevOps Jira RHELPLAN-48526

Red Hat Customer Portal 02695311

Red Hat Issue Tracker RHELPLAN-48526

Red Hat Knowledge Base (Solution) 3495091

links to

ClusterLabs T26

(1 links to)