Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4316

Prometheus Operator pod is restarting after exact 10mins

XMLWordPrintable

    • Moderate
    • False
    • Hide

      None

      Show
      None
    • Troubleshoot
    • Customer Facing
    • Openshift Cluster Manager

      Description of problem:

       

      Prometheus Operator pod is restarting after 10 min in OCP 4.10.

      From Prometheus Operator pod the following logs are seen.

              message: 
                se retry. Original error: stream error: stream ID 399; INTERNAL_ERROR; received from peer"
                level=warn ts=2022-11-29T14:48:31.363380184Z caller=operator.go:346 component=alertmanageroperator informer=Secret msg="cache sync not yet completed"
                level=warn ts=2022-11-29T14:49:31.36160769Z caller=operator.go:346 component=alertmanageroperator informer=Secret msg="cache sync not yet completed"
                level=warn ts=2022-11-29T14:49:44.792499686Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 433; INTERNAL_ERROR; received from peer"
                level=error ts=2022-11-29T14:49:44.792634826Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 433; INTERNAL_ERROR; received from peer"
                level=warn ts=2022-11-29T14:50:31.362449753Z caller=operator.go:346 component=alertmanageroperator informer=Secret msg="cache sync not yet completed"
                level=error ts=2022-11-29T14:51:31.364389628Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="unable to sync caches for alertmanager"
                level=error ts=2022-11-29T14:51:31.364447429Z caller=operator.go:355 component=alertmanageroperator informer=Secret msg="failed to sync cache"
                level=warn ts=2022-11-29T14:51:31.364719365Z caller=main.go:407 msg="Server shutdown error" err="context canceled"
                level=warn ts=2022-11-29T14:51:31.364747962Z caller=operator.go:346 component=alertmanageroperator informer=Secret msg="cache sync not yet completed"
                level=warn ts=2022-11-29T14:51:31.368092701Z caller=main.go:412 msg="Unhandled error received. Exiting..." err="failed to sync cache for Secret informer"
              reason: Error
      

      Checking Kube API server logs:

      2022-11-28T10:29:23.881004817Z I1128 10:29:23.880781   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-generated"
      2022-11-28T10:29:23.881933010Z I1128 10:29:23.881841   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-tls"
      2022-11-28T10:29:23.882833556Z I1128 10:29:23.882741   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-proxy"
      2022-11-28T10:29:23.883584958Z I1128 10:29:23.883508   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-kube-rbac-proxy-metric"
      2022-11-28T10:29:23.884483964Z I1128 10:29:23.884374   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-dockercfg-5sqcv"
      2022-11-28T10:29:23.886302739Z I1128 10:29:23.886217   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-tls-assets-0"
      2022-11-28T10:29:23.887236843Z I1128 10:29:23.887055   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-kube-rbac-proxy"
      2022-11-28T10:29:23.890385993Z I1128 10:29:23.890293   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get unknown configmap openshift-monitoring/alertmanager-trusted-ca-bundle-ev1qal76l341g"
      2022-11-28T10:29:23.963100847Z I1128 10:29:23.962964   18 node_authorizer.go:203] "NODE DENY" err="node 'sdeb-ocpin-p4001.sys.schwarz' cannot get pvc openshift-monitoring/alertmanager-pvc-alertmanager-main-0, no relationship to this object was found in the node authorizer graph"
      2022-11-28T10:29:24.002961502Z I1128 10:29:24.002749   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-tls-assets-0"
      2022-11-28T10:29:24.003760960Z I1128 10:29:24.003683   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-dockercfg-5sqcv"
      2022-11-28T10:29:24.004613539Z I1128 10:29:24.004505   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-generated"
      2022-11-28T10:29:24.005383738Z I1128 10:29:24.005298   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-proxy"
      2022-11-28T10:29:24.006671552Z I1128 10:29:24.005595   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-main-tls"
      2022-11-28T10:29:24.006671552Z I1128 10:29:24.006115   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-kube-rbac-proxy"
      2022-11-28T10:29:24.006671552Z I1128 10:29:24.006539   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown secret openshift-monitoring/alertmanager-kube-rbac-proxy-metric"
      2022-11-28T10:29:24.007216831Z I1128 10:29:24.007126   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get unknown configmap openshift-monitoring/alertmanager-trusted-ca-bundle-ev1qal76l341g"
      2022-11-28T10:29:24.007919813Z I1128 10:29:24.007840   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get pvc openshift-monitoring/alertmanager-pvc-alertmanager-main-1, no relationship to this object was found in the node authorizer graph"
      2022-11-28T10:29:24.112280598Z I1128 10:29:24.112137   18 node_authorizer.go:203] "NODE DENY" err="node 'se1-ocpin-p4000.sys.schwarz' cannot get pvc openshift-monitoring/alertmanager-pvc-alertmanager-main-1, no relationship to this object was found in the node authorizer graph"
      2022-11-28T10:29:28.803492855Z I1128 10:29:28.802757   18 trace.go:205] Trace[553366156]: "Update" url:/apis/rbac.authorization.k8s.io/v1/clusterroles/alertmanager-main,user-agent:Go-http-client/2.0,audit-id:e730e96f-c144-4385-9e9b-52f048f88f3b,client:4.160.57.16,accept:application/json, */*,protocol:HTTP/2.0 (28-Nov-2022 10:29:28.099) (total time: 703ms):
      2022-11-28T10:30:05.565948456Z I1128 10:30:05.565809   18 trace.go:205] Trace[1031716344]: "Get" url:/api/v1/namespaces/openshift-monitoring/pods/alertmanager-main-1/log,user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36,audit-id:c4415728-79ca-4e05-9fc4-6a39e4fefee5,client:4.160.59.201,accept:,protocol:HTTP/1.1 (28-Nov-2022 10:30:02.580) (total time: 2984ms):
      
      

      An important note about the cluster: huge cluster (348 nodes) and the number of secrets is really high (11017)

      Version-Release number of selected component (if applicable):
      
       

      How reproducible:

      NA

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

       

       

       

       

       

       

            Unassigned Unassigned
            acandelp Adrian Candel
            Rahul Gangwar Rahul Gangwar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: