Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: Logging 5.0
Component/s: Log Collection
Labels:
- devel_ack+
- logging
- rosa

Blocked:
False
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:
Undefined

SFDC Cases Links:
SFDC Cases Counter:

Description

The OpenShift Logging(EFK) stack is user workload at the ROSA.
But the FluentdNodeDown critical alert is always firing due to not collected the required metrics through user workload prometheus.

// The message from Prometheus operator in "openshift-user-workload-monitoring" project.

level=warn ts=2021-07-02T02:39:00.701373761Z caller=operator.go:1675 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via tls config which Prometheus specification prohibits" servicemonitor=openshift-logging/fluentd namespace=openshift-user-workload-monitoring prometheus=user-workload

Because the fluentd Service monitor tls config invalid at the user workload prometheus as follows.

// Look why above message is shown, the fluentd servicemonitor tlsconfig does not met by the following conditions.
https://github.com/openshift/prometheus-operator/blob/ce7d979635b9d1210db48d54485bc924aed37cdb/pkg/prometheus/operator.go#L1964-L1966

	if tlsConf.CAFile != "" || tlsConf.CertFile != "" || tlsConf.KeyFile != "" {
		return errors.New("it accesses file system via tls config which Prometheus specification prohibits")
	}

Version-Release number of selected component (if applicable):

On ROSA(4.7.z), OpenShift Logging 5.0 (EFK)

How reproducible:

You can reproduce this issue as installing OpenShift Logging on ROSA

Or, you can also reproduce this issue on OCPv4.7.z as OpenShift Logging install without "openshift.io/cluster-monitoring" label in "openshift-logging".
You can see the "FluentdNodeDown" critical alert would be firing within 10 mins.

Actual results:

As always "FluentdNodeDown" critical alert is firing even though the all fluentd pods are up and running without issues due to not collecting required metrics by invalid tls config at the fluentd servicemonitor.

Expected results:

OpenShift Logging(EFK) stack should provide valid tls configs for fluentd ServiceMonitor in order to collect the metrics by user workload promehtues. And it can also suppress incorrect "FluentdNodeDown" alert.

Additional info:

I've verified if the fluentd servicemonitor with valid tls config(even though just ignoring tls config) works well as follows.

1. For testing, firstly stop the cluster-logging-operator.
2. Modify the fluentd servicemonitor tls config(in tlsConfig: section) as follows.

:
spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    path: /metrics
    port: metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
      serverName: fluentd.openshift-logging.svc
  jobLabel: monitor-fluentd
  namespaceSelector:
    matchNames:
    - openshift-logging
  selector:
    matchLabels:
      logging-infra: support

3. Check if the fluentd metrics are collected by user workload prometheus.

$ oc rsh -n openshift-user-workload-monitoring -c prometheus prometheus-user-workload-1 \
  curl 'http://localhost:9090/api/v1/query?query=up%7Bjob%3D"fluentd"%7D+%3D%3D+1' | jq .
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "container": "fluentd",
          "endpoint": "metrics",
          "instance": "10.128.0.7:24231",
          "job": "fluentd",
          "namespace": "openshift-logging",
          "pod": "fluentd-5rnpl",
          "service": "fluentd"
        },
        "value": [
          1625812558.084,
          "1"
        ]
      },
:

Attachments

Issue Links

account is impacted by

LOG-1564 Cluster Logging Operator should be coupled loosely with Cluster Monitoring Prometheus in "openshift-logging" on ROSA

Closed

links to

KCS 5692801: Prometheus could not scrape fluentd for more than 10m alert in Alertmanager in OCP 4

openshift/cluster-logging-operator#1106: Eliminate the Promtheus operator dependency of the "arbitraryFSAccessThroughSMs"

Activity

People

Assignee:: Unassigned

Reporter:: Daein Park

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2021/07/09 5:03 PM

Updated:: 2022/05/19 6:56 PM

Resolved:: 2022/04/27 9:02 PM