Getting message, "Prometheus could not scrape fluentd for more than 10m."


      Before this change, the cluster-logging-operator utilized cluster scoped roles and bindings to establish permissions for the prometheus service account to scrape metrics. These permissions were only created when deploying the Operator using the console interface but was missing when deploying from the command line. This fixes that issue by making this role and binding namespace scoped.
    • Logging (Core) - Sprint 211, Logging (Core) - Sprint 216, Logging (Core) - Sprint 217

      This was originally opened as a bug against Monitoring: https://bugzilla.redhat.com/show_bug.cgi?id=2021342

      Monitoring team moved it to Logging component but as the issue is on Logging 5.2 I am moving this to JIRA. Initial problem description reported below, followed by copies of comments from Monitoring team.


      OpenShift 4.7.34

      Openshift Logging: cluster-logging.5.2.2-21
      Description of problem:
      Getting message, "Prometheus could not scrape fluentd for more than 10m."

      How reproducible:

      Additional info:
      Customer set label openshift.io/cluster-monitoring: "true" set but still that error is not clearing.

      The prometheus pods are noting this error on repeat:

      2021-10-31T03:05:06.385693354Z level=error ts=2021-10-31T03:05:06.385Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
      2021-10-31T03:05:08.607296440Z level=error ts=2021-10-31T03:05:08.607Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""
      2021-10-31T03:05:31.197590776Z level=error ts=2021-10-31T03:05:31.197Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:426: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""

      We found a similar bug from an older version:
      Using diagnostic steps from that bug:

      1. token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
      2. oc auth can-i list endpoints -n openshift-logging --token $token
      These all result "no". I suspect something has failed to set the proper rolebindings for prometheus-k8s. Are there roles that should be added? Can they be added manually?


      Arunprasad Rajkumar 2021-11-09 06:35:35 UTC

      Other cluster operators(e.g. cluster-etcd-operator] defines explicit role[1] bindings[2] to the `prometheus-k8s` service account. You may need to follow the same.

      But I'm wondering why it was not done from cluster-logging operator!

      [1] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_01_prometheusrole.yaml
      [2] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_02_prometheusrolebinding.yaml


      Arunprasad Rajkumar 2021-11-09 07:52:24 UTC

      It seems cluster-logging-operator has the necessary role[1] binding[2] to the `prometheus-k8s` service account.

      [1] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0100_clusterroles.yaml
      [2] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0110_clusterrolebindings.yaml

