-
Bug
-
Resolution: Done-Errata
-
Major
-
rhel-8.9.0.z, rhel-8.10.z, rhel-9.3.0.z, rhel-9.4.z, rhel-9.5.z
-
sap-cluster-connector-3.0.1-10.el10_0.1
-
No
-
Low
-
rhel-sst-sap
-
Pass
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
None
What were you trying to do that didn't work?
Customer was trying to operate the cluster as normal, but experienced issues from `sap-cluster-connector` following software updates, and this was no longer detecting maintenance-mode ( MM ).
What is the impact of this issue to you?
Customer uses the below process for stopping / starting SAP system:
- issue command "pcs property set maintenance-mode=true" on cluster controller
- issue command "sapcontrol -nr 51 -function StopSystem ALL" on node running scs instance
- wait for SAP stop with command "sapcontrol -nr 00 -function WaitforStopped 1500 10" on node running scs instance
Normally the resource using "sap_cluster_connector" would detect MM and go hands off during above changes. In this case though, customer describes the following behavior when the resource executes /usr/bin/sap_cluster_connector hook:
- SAPInstance resource is disabled - reports the cluster is not in maintenance (when it is showing maintenance)
- hangs stopping the SAP system (sapcontrol waits at stopping and eventually times out)
Please provide the package NVR for which the bug is seen:
$ grep -e pacemaker-[0-9] -e sap-cluster-connector */installed-rpms pacemaker-2.1.7-5.2.el9_4.x86_64 Tue Feb 18 10:42:54 2025 sap-cluster-connector-3.0.1-7.el9.1.noarch Tue Feb 18 10:42:57 2025
How reproducible is this bug?:
100%
Steps to reproduce
- Configure an SAP cluster with sap-cluster-connector configured in RHEL 9.2+:
- Place the cluster into maintenance-mode:
$ pcs property set maintenance-mode=true
- Run action against the application such as "sapcongrol -nr <##> -function Stop 1500 10" or manually run "sap_cluster_connector" ( lsr option will show MM status ):
[root@clustera-rhel9 ~]# ./sap_cluster_connector --help | grep lsr lsr --out FILE --sid SID --ino INO [root@clustera-rhel9 ~]# ./sap-cluster-connector-3.0.1-8.el9_0 lsr --out output.txt --sid JOS --ino 10
The maintenance-mode status will be reported to "/var/log/messages", and this will report "not in maintenance" for RHEL +9.2 in all cases.
Expected results
We detect maintenance-mode ( example is RHEL 8 ):
$ tail /var/log/messages
-----------------------------------------8<-----------------------------------------
Feb 19 12:11:40 beaver-11 sap-cluster-connector-3.0.1-8.el9_0[2874733]: lsr call (sid=JOS,ino=10,out=output.txt)
Feb 19 12:11:40 beaver-11 sap-cluster-connector-3.0.1-8.el9_0[2874733]: resource jos_ers10 is in maintanance mode
Feb 19 12:11:40 beaver-11 sap-cluster-connector-3.0.1-8.el9_0[2874733]: check_maintenance_mode call (retcode=0)
Actual results
We do not detect maintenance-mode ( example is RHEL 9.4 ):
$ tail /var/log/messages
-----------------------------------------8<-----------------------------------------
Feb 18 10:52:33 rhel9-node1 sap_cluster_connector[29552]: Resource JOS_SAP_ASCS00 with InstanceName=JOS_ASCS00_sapjos-scs found Feb 18 10:52:33 rhel9-node1 sap_cluster_connector[29552]: resource JOS_SAP_ASCS00 is not in maintanance mode Feb 18 10:52:33 rhel9-node1 sap_cluster_connector[29552]: resource JOS_SAP_ASCS00 is not in maintanance mode
Additional Notes:
The issue is caused by how the "crm_resource" command represents "maintenance-mode" which appears to change at some point in RHEL 9:
RHEL 8 showed "(unmanaged)" to reflect a cluster in MM:
[root@beaver-1 ~]# crm_resource Full List of Resources: * Resource Group: jos_ASCS00_group (unmanaged): * jos_vip_ascs00 (ocf::heartbeat:IPaddr2): Started (unmanaged) * jos_ascs00 (ocf::heartbeat:SAPInstance): Started (unmanaged) * Resource Group: jos_ERS10_group (unmanaged): * jos_vip_ers10 (ocf::heartbeat:IPaddr2): Started (unmanaged) * jos_ers10 (ocf::heartbeat:SAPInstance): Started (unmanaged) * Clone Set: sap-filesystems-clone [sap-filesystems] (unmanaged): * Resource Group: sap-filesystems:0 (unmanaged): * jos_fs_sapmnt (ocf::heartbeat:Filesystem): Started (unmanaged) * jos_fs_sys (ocf::heartbeat:Filesystem): Started (unmanaged) * jos_fs_trans (ocf::heartbeat:Filesystem): Started (unmanaged) * Resource Group: sap-filesystems:1 (unmanaged): * jos_fs_sapmnt (ocf::heartbeat:Filesystem): Started (unmanaged) * jos_fs_sys (ocf::heartbeat:Filesystem): Started (unmanaged) * jos_fs_trans (ocf::heartbeat:Filesystem): Started (unmanaged)
RHEL 9 shows "(maintenance)" to reflect a cluster in MM:
[root@clustera-rhel9 ~]# crm_resource Full List of Resources: * dummy1 (ocf:heartbeat:Dummy): Started (maintenance) * postgres-vip (ocf:heartbeat:IPaddr2): Stopped (maintenance) * Clone Set: postgresdb-clone [postgresdb] (promotable, maintenance): * Stopped: [ clustera-rhel9 clusterb-rhel9 ] * ip_10_0_0_50 (ocf:heartbeat:IPaddr2): Started (maintenance)
The above change to pacemaker behavior is an issue for the `sap-cluster-connector` script as currently the "check_maintenance_mode" function in this script is written to just look for "unmanaged" key word ( in all versions ):
# Check of a few "sap-cluster-connector" versions I had available: $ grep -n cmd_crm_resource.*unmanaged sap-cluster-connector* sap-cluster-connector-3.0.1-5.el8:305: my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -q unmanaged"); sap-cluster-connector-3.0.1-7.el9.1:305: my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -q unmanaged"); sap-cluster-connector-3.0.1-8.el8_6:305: my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -q unmanaged"); sap-cluster-connector-3.0.1-8.el9_0:305: my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -q unmanaged");
# Full function call: $ cat sap-cluster-connector-3.0.1-8.el9_0 299 sub check_maintanance_mode { 300 my ($res) = @_; 301 my $retcode; 302 $nowstring = localtime; 303 printf "%s : check_maintenance_mode($res)\n", $nowstring; 304 305 my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -q unmanaged"); #<---- 306 $retcode=$retcode >> 8; 307 if ( $retcode eq "0" ) { 308 syslog("LOG_INFO", "resource %s is in maintanance mode\n", $res); 309 } else { 310 syslog("LOG_INFO", "resource %s is not in maintanance mode\n", $res); 311 } 312 return $retcode; 313 }
So this "check_maintanance_mode" function within `sap-cluster-connector` likely needs to be expanded to match the change in `pacemaker` status check behaviors where we report `(maintenance)` instead of `(unmanaged)` now. Customer in case 04064873 has noted they have made the below change to their own `sap-cluster-connector` script and this is avoids the issue ( change greps for both "unmanaged" and "maintenance" ):
L305: my $retcode = system("$cmd_crm_resource | grep -E \"\[\[:space:\]\]$res\[\[:space:\]\]\" | grep -E -q \"maintenance|unmanaged\"");
I believe below is the upstream pacemaker commit which changed the report from `(unmanaged)` to `(maintenance)` in status checks.
Having trouble pinpointing exactly where this change made it to the RH package, but I would guess the below indicated rebase:
2024-05-22 Chris Lumens <clumens@redhat.com> - 2.1.8-1 - Rebase on upstream 2.1.8-rc1 release - Fix a typo in the help output of pacemaker-fenced - Fix escaping characters in XML attribute output - Resolves: RHEL-25819 - Resolves: RHEL-30822 2024-03-21 Chris Lumens <clumens@redhat.com> - 2.1.7-5 <--- customer hit issue here - Fix upgrading to this package on multilib systems - Resolves: RHEL-28999 2024-01-31 Chris Lumens <clumens@redhat.com> - 2.1.7-4 - Properly validate attribute set type in pacemaker-attrd - Fix `crm_attribute -t nodes --node localhost` - Resolves: RHEL-13216 - Resolves: RHEL-17225 - Resolves: RHEL-23498 2024-01-16 Chris Lumens <clumens@redhat.com> - 2.1.7-3 <--- I am guessing MM report change was introduced here - Rebase on upstream 2.1.7 final release - Fix documentation for Pacemaker Remote schema transfers - Do not check CIB feature set version when CIB_file is set - Consolidate attrd cache handling - Avoid duplicating option metadata across daemons - Related: RHEL-7665 - Related: RHEL-13216 - Resolves: RHEL-7702
- links to
-
RHBA-2025:150095 sap-cluster-connector update