Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: odf-4.18
Affects Version/s: odf-4.15
Component/s: ceph-monitoring
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2313424
Dev Approval:
Committed
Prod build version:
4.18.0-107
QE Approval:
Committed
Release Note Text:

Hide
Description: MDSCacheUsageHigh alert has been removed. This alert was querying `rss` to send high MDS cache usage warning . `rss` is not the right metrics for this. `mds_co_bytes` is the correct metric but its not exposed by Ceph. As a result, wrong alerts were triggered. The alert has been removed until we find a better solution.

Consequence: MDSCacheUsageHigh customer alert won't trigger.

Show
Description: MDSCacheUsageHigh alert has been removed. This alert was querying `rss` to send high MDS cache usage warning . `rss` is not the right metrics for this. `mds_co_bytes` is the correct metric but its not exposed by Ceph. As a result, wrong alerts were triggered. The alert has been removed until we find a better solution. Consequence: MDSCacheUsageHigh customer alert won't trigger.
Release Note Type:
Removed Functionality
Target Release:

odf-4.18
Git Pull Request:
https://github.com/red-hat-storage/ocs-operator/pull/2938
Intelligence Requested:
Market:

Target Version:

odf-4.18

Test Coverage:

-
Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

The customer is experiencing the MDSCacheUsageHigh alert firing. They've applied the fix for this [1], but the alert is still firing. Furthermore, when we look at the memory consumption of the mds pods, it's only at ~25% currently.

[root@bastionocpcrystal ~]# oc rsh -n openshift-storage $(oc get pods -n openshift-storage -o name -l app=rook-ceph-operator)
sh-5.1$ export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'
sh-5.1$ ceph config dump | grep mds_cache_memory_limit
mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 8589934592
mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 8589934592

For node msplatform-x9ggd-storage-tnhvb:

Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- —
openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-75b6d77bggzfx 2 (12%) 2 (12%) 16Gi (25%) 16Gi (25%) 51m

For node msplatform-x9ggd-storage-vh2hj:

Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- —
openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7d8c86bdzcrk2 2 (12%) 2 (12%) 16Gi (25%) 16Gi (25%) 52m

[1] https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.15/html-single/troubleshooting_openshift_data_foundation/index?extIdCarryOver=true&sc_cid=7013a000003SyEYAA0#ceph_mds_cache_usage_high_rhodf

Version of all relevant components (if applicable):

ODF 4.15.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No but it's annoying given that it's appears to be a faulty alert that's firing.

Is there any workaround available to the best of your knowledge?

Not to my knowledge

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Unknown

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Unknown

Steps to Reproduce:
1.
2.
3.

Actual results:
MDSCacheUsageHigh alert firing

Expected results:
MDSCacheUsageHigh does not fire

Additional info:

It should also be noted that the "mds_cache_memory_limit" value for both mds pods did not increase to half of the mds pod's memory values as it should have. I had to set the "mds_cache_memory_limit" to "8589934592" manually using the rook-ceph-tools pod. This still didn't resolve the misfiring alert.