Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Normal
Fix Version/s: OpenShift 4.16 Freeze
Affects Version/s: None
Component/s: Monitoring
Labels:
- stretch-goal

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
OCP-416-monitoring-feature-docs
Git Pull Request:
https://github.com/openshift/openshift-docs/pull/76402
Intelligence Requested:
Market:

Sprint:
OBSDOCS (May 6 - May 28) #254, OBSDOCS (May 27 - Jun 17) #255

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Our documentation suggests creating an alert after configuring scrape sample limits.

That PrometheusRule object has two alerts configured within it [1]

`ApproachingEnforcedSamplesLimit`

`TargetDown`

The `Targetdown` alert is designed to fire after the `ApproachingEnforcedSamplesLimit` because the target is dropped once the enforced sample limit is reached

The TargetDown alert is creating false positives - its firing for reasons other than pods in the namespace have reached there enforced sample limit (e.g. the metrics endpoint may be down).

User-defined monitoring should provide out-of-the-box metrics that will help with troubleshooting:

Update Prometheus user-workload to enable additional scrape metrics [2]
Rewrite the ApproachingEnforcedSamplesLimit alert expression in the OCP documentation like "(scrape_samples_post_metric_relabeling / (scrape_sample_limit > 0)) > 0.9" (which reads as "alert when the number of ingested samples reaches 90% of the configured limit).
Document how a user would know that a target has hit the limit (e.g. the Targets page should have the information).

[1] - https://docs.openshift.com/container-platform/4.12/monitoring/configuring-the-monitoring-stack.html#creating-scrape-sample-alerts_configuring-the-monitoring-stack

[2] - https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics