Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1561

fluentd ServiceMonitor in OpenShift Logging cannot be collected by user workload prometheus due to invalid tls config on ROSA

XMLWordPrintable

    • False
    • False
    • NEW
    • NEW
    • Undefined

      The OpenShift Logging(EFK) stack is user workload at the ROSA.
      But the FluentdNodeDown critical alert is always firing due to not collected the required metrics through user workload prometheus.

      // The message from Prometheus operator in "openshift-user-workload-monitoring" project.

      level=warn ts=2021-07-02T02:39:00.701373761Z caller=operator.go:1675 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via tls config which Prometheus specification prohibits" servicemonitor=openshift-logging/fluentd namespace=openshift-user-workload-monitoring prometheus=user-workload
      

      Because the fluentd Service monitor tls config invalid at the user workload prometheus as follows.

      // Look why above message is shown, the fluentd servicemonitor tlsconfig does not met by the following conditions.
      https://github.com/openshift/prometheus-operator/blob/ce7d979635b9d1210db48d54485bc924aed37cdb/pkg/prometheus/operator.go#L1964-L1966

      	if tlsConf.CAFile != "" || tlsConf.CertFile != "" || tlsConf.KeyFile != "" {
      		return errors.New("it accesses file system via tls config which Prometheus specification prohibits")
      	}
      

      Version-Release number of selected component (if applicable):

      On ROSA(4.7.z), OpenShift Logging 5.0 (EFK)

      How reproducible:

      You can reproduce this issue as installing OpenShift Logging on ROSA

      Or, you can also reproduce this issue on OCPv4.7.z as OpenShift Logging install without "openshift.io/cluster-monitoring" label in "openshift-logging".
      You can see the "FluentdNodeDown" critical alert would be firing within 10 mins.

      Actual results:

      As always "FluentdNodeDown" critical alert is firing even though the all fluentd pods are up and running without issues due to not collecting required metrics by invalid tls config at the fluentd servicemonitor.

      Expected results:

      OpenShift Logging(EFK) stack should provide valid tls configs for fluentd ServiceMonitor in order to collect the metrics by user workload promehtues. And it can also suppress incorrect "FluentdNodeDown" alert.

      Additional info:

      I've verified if the fluentd servicemonitor with valid tls config(even though just ignoring tls config) works well as follows.

      1. For testing, firstly stop the cluster-logging-operator.
      2. Modify the fluentd servicemonitor tls config(in tlsConfig: section) as follows.

      :
      spec:
        endpoints:
        - bearerTokenSecret:
            key: ""
          path: /metrics
          port: metrics
          scheme: https
          tlsConfig:
            insecureSkipVerify: true
            serverName: fluentd.openshift-logging.svc
        jobLabel: monitor-fluentd
        namespaceSelector:
          matchNames:
          - openshift-logging
        selector:
          matchLabels:
            logging-infra: support
      

      3. Check if the fluentd metrics are collected by user workload prometheus.

      $ oc rsh -n openshift-user-workload-monitoring -c prometheus prometheus-user-workload-1 \
        curl 'http://localhost:9090/api/v1/query?query=up%7Bjob%3D"fluentd"%7D+%3D%3D+1' | jq .
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {
                "__name__": "up",
                "container": "fluentd",
                "endpoint": "metrics",
                "instance": "10.128.0.7:24231",
                "job": "fluentd",
                "namespace": "openshift-logging",
                "pod": "fluentd-5rnpl",
                "service": "fluentd"
              },
              "value": [
                1625812558.084,
                "1"
              ]
            },
      :
      

              Unassigned Unassigned
              rhn-support-dapark Daein Park
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: