Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1972

Getting message, "Prometheus could not scrape fluentd for more than 10m."

    XMLWordPrintable

Details

    • False
    • False
    • NEW
    • VERIFIED
    • Hide
      Before this change, the cluster-logging-operator utilized cluster scoped roles and bindings to establish permissions for the prometheus service account to scrape metrics. These permissions were only created when deploying the Operator using the console interface but was missing when deploying from the command line. This fixes that issue by making this role and binding namespace scoped.
      Show
      Before this change, the cluster-logging-operator utilized cluster scoped roles and bindings to establish permissions for the prometheus service account to scrape metrics. These permissions were only created when deploying the Operator using the console interface but was missing when deploying from the command line. This fixes that issue by making this role and binding namespace scoped.
    • Logging (Core) - Sprint 211, Logging (Core) - Sprint 216, Logging (Core) - Sprint 217

    Description

      This was originally opened as a bug against Monitoring: https://bugzilla.redhat.com/show_bug.cgi?id=2021342

      Monitoring team moved it to Logging component but as the issue is on Logging 5.2 I am moving this to JIRA. Initial problem description reported below, followed by copies of comments from Monitoring team.

      ----------

      OpenShift 4.7.34

      Openshift Logging: cluster-logging.5.2.2-21
       
      Description of problem:
      Getting message, "Prometheus could not scrape fluentd for more than 10m."

      How reproducible:
      Unconfirmed

      Additional info:
      Customer set label openshift.io/cluster-monitoring: "true" set but still that error is not clearing.

      The prometheus pods are noting this error on repeat:

      2021-10-31T03:05:06.385693354Z level=error ts=2021-10-31T03:05:06.385Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
      2021-10-31T03:05:08.607296440Z level=error ts=2021-10-31T03:05:08.607Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""
      2021-10-31T03:05:31.197590776Z level=error ts=2021-10-31T03:05:31.197Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:426: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""

      We found a similar bug from an older version:
      https://bugzilla.redhat.com/show_bug.cgi?id=1774907
      Using diagnostic steps from that bug:

      1. token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
      2. oc auth can-i list endpoints -n openshift-logging --token $token
      3. oc auth can-i list endpoints -n openshift-logging --token $token
      4. oc auth can-i list endpoints -n openshift-logging --token $token
      5. oc auth can-i list endpoints -n openshift-logging --token $token
      6. oc auth can-i list endpoints -n openshift-logging --token $token
      7. oc auth can-i list endpoints -n openshift-logging --token $token

      These all result "no". I suspect something has failed to set the proper rolebindings for prometheus-k8s. Are there roles that should be added? Can they be added manually?

      ----------

      Arunprasad Rajkumar 2021-11-09 06:35:35 UTC

      Other cluster operators(e.g. cluster-etcd-operator] defines explicit role[1] bindings[2] to the `prometheus-k8s` service account. You may need to follow the same.

      But I'm wondering why it was not done from cluster-logging operator!

      [1] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_01_prometheusrole.yaml
      [2] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_02_prometheusrolebinding.yaml

      ----------

      Arunprasad Rajkumar 2021-11-09 07:52:24 UTC

      It seems cluster-logging-operator has the necessary role[1] binding[2] to the `prometheus-k8s` service account.

      [1] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0100_clusterroles.yaml
      [2] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0110_clusterrolebindings.yaml

      Attachments

        1. cr
          4 kB
        2. image-2022-03-10-14-15-53-935.png
          image-2022-03-10-14-15-53-935.png
          16 kB
        3. screenshot-1.png
          screenshot-1.png
          143 kB

        Issue Links

          Activity

            People

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-stwalter Steven Walter
              Giriyamma K R Giriyamma K R
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: