Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Monitoring
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

KubePersistentVolumeFillingUp (= firing when less than 3% of PV size left) alerts are firing for several clusters that have a configured PV size of 100 GB and a max retention of 90 GB.

Our CMO config: https://github.com/openshift/managed-cluster-config/blob/master/resources/cluster-monitoring-config/config.yaml#L29
 retention: 11d
  retentionSize: 90GB
  volumeClaimTemplate:
    metadata:
      name: prometheus-data
    spec:
      resources:
        requests:
          storage: 100Gi 

We also saw major desyncs in the two PV sizes (screenshot attached), possibly pointing at https://access.redhat.com/solutions/7024829. We are unsure the desync is the cause of it though, as the difference is up to 80 GB.

Version-Release number of selected component (if applicable):

    4.14.z

How reproducible:

    Repeated pattern of alerts flapping for 4 days then no alerts for 4 days

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    KubePersistentVolumeFillingUp firing

Expected results:

    KubePersistentVolumeFillingUp should not fire as PV used data should never be close to the PV limit with the above CMO settings (https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects)

Additional info:

Attaching:
- partial must gather (no logs)
- container logs for openshift-monitoring namespace
- screenshot of PV size difference

is cloned by

OBSDOCS-765 Add clarification about exceeded maximum retention size to docs

Closed

Assignee:: Simon Pasquier

Reporter:: Claudio Busse

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/01/23 12:21 PM

Updated:: 2025/07/23 11:58 PM

Resolved:: 2024/01/23 3:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates