Loading...

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: ACM 2.12.0, ACM 2.13.0, ACM 2.14.0, ACM 2.15.0
Affects Version/s: None
Component/s: Documentation, Observability
Labels:

Activity Type:
Product / Portfolio Work
Story Points:
6
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Note: Doc team updates the current version of the documentation and the
two previous versions (n-2), but we address *only high-priority, or
customer-reported issues* for -2 releases in support.
Describe the changes in the doc and link to your dev story:

1. - [x] Mandatory: Add the required version to the Fix version/s field.

2.14

2. - [x] Mandatory: Choose the type of documentation change or review.

[ ] We need to update to an existing topic

[x] We need to add a new document to an existing section

This should be added to the Customizing Observability Configuration section

with the topic title being "Scaling up metrics collection"

[ ] We need a whole new section; this is a function not
documented before and doesn't belong in any current section

[ ] We need an Operator Advisory review and approval

[ ] We need a z-Stream (Errata) Advisory and Release note for
MCE and/or ACM

3. - [x] Mandatory: Find the link to where the documentation update
should go and add it to the recommended changes. You can either use the
published doc or the staged repo for this step:

https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12/html-single/observability/index#customizing-observability

Note: As the feature and doc is understood, this recommendation may
change. If this is new documentation, link to the section where you think
it should be placed.

Customer Portal published version

https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12

Doc staged repo within the ACM Workspace:
https://github.com/stolostron/rhacm-docs

4. - [x] Mandatory for GA content:

[x] Add steps, the diff, known issue, and/or other important
conceptual information in the following space:

We added a new sharding feature for metrics-collector in spoke clusters.

Initially, the metrics-collector starts with a single internal goroutine, which makes a /federate request to the in-cluster Prometheus, with the metrics allowlist set as arguments. It receives the response, in the form of the latest timeseries sample values from each timeseries and their labels, as stored in the Prometheus TSDB.

It then converts that response to a Prometheus Remote Write request and sends it to the hub cluster's Observatorium API endpoint, after which it is ingested in the Hub Cluster's Thanos instance.

From 2.12 we added a new `workers` field to the MCO CRD (which by default is set to 1). This worker value ensures that metrics-collector starts with that number of goroutines internally, and the /federate calls are sharded (so each goroutine only takes part of the allowlist as the URL params, and sends that as an individual remote write request). This ensures that you have much smaller requests, which will be sent individually, and allows the users to scale up their metrics collection as much as they need (and increase their allowlist infinitely). This also makes the collection and sending reliable.

Now usually you would want to set the same workers value across all your spoke clusters. In which case, setting it on MCO CRD will ensure that the workers value is propagated to every single spoke cluster (using the spoke ObservabilityAddon CRD).

But in certain cases, users might want to override workers value for certain spoke clusters only. To do this, they can set the annotation observability.open-cluster-management.io/addon-source: "override" on the spoke's ObservabilityAddon spec, and set a different worker value there.

To revert to setting it via MCO, they can simply change the annotation back to observability.open-cluster-management.io/addon-source: "mco"

[ ] *Add Required access level *(example, *Cluster
Administrator*) for the user to complete the task:

[x] Add verification at the end of the task, how does the user
verify success (a command to run or a result to see?)

Users can verify this, by looking at arguments on the metrics-collector pods.

[x] Add link to dev story here:

https://github.com/stolostron/multicluster-observability-operator/pull/1718

https://issues.redhat.com/browse/ACM-14316

5. - [ ] Mandatory for bugs: What is the diff? Clearly define what the
problem is, what the change is, and link to the current documentation. Only
use this for a documentation bug.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates