Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-370

[2314717] [ODF on ROSA HCP] PVC utilization alerts are not firing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.19
    • odf-4.16
    • ocs-operator
    • None
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Running test_pvc_expansion_when_full alerts for PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical are not firing

      prometheus-operator logs:
      level=info ts=2024-09-25T15:35:38.553216149Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
      level=info ts=2024-09-25T15:35:38.593611231Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
      level=info ts=2024-09-25T15:35:38.754635432Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
      level=info ts=2024-09-25T15:35:39.352659418Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
      level=info ts=2024-09-25T15:35:39.352681394Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
      level=warn ts=2024-09-25T16:16:34.488279555Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go:326: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488352741Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Alertmanager ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488341065Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:409: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488480237Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PodMonitor ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=error ts=2024-09-25T16:16:34.488520917Z caller=controller.go:189 component=kubelet_endpoints kubelet_object=kube-system/kubelet msg="Failed to synchronize nodes" err="listing nodes failed: Get \"https://172.30.0.1:443/api/v1/nodes\": http2: client connection lost"
      level=warn ts=2024-09-25T16:16:34.488537442Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PrometheusRule ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488548191Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PrometheusRule ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488464049Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Prometheus ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488536831Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:411: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488520589Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488579151Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1alpha1.AlertmanagerConfig ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488602736Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.ThanosRuler ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488603637Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.ServiceMonitor ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488629059Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488645182Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488642317Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Probe ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488534986Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go:328: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488651016Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488656742Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488642006Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488682305Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488636522Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/prometheus/server/operator.go:488: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=warn ts=2024-09-25T16:16:34.488804733Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/prometheus/server/operator.go:486: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
      level=info ts=2024-09-25T16:25:30.538200059Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
      level=info ts=2024-09-25T16:25:30.538273824Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
      level=info ts=2024-09-25T16:25:34.432611416Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
      level=info ts=2024-09-25T16:25:34.432635855Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"

      these are existing alerts fired now on cluster:
      ClusterMonitoringOperatorDeprecatedConfig
      AlertmanagerReceiversNotConfigured
      PrometheusDuplicateTimestamps
      PrometheusDuplicateTimestamps
      PrometheusOutOfOrderTimestamps
      PrometheusOutOfOrderTimestamps
      PrometheusRemoteStorageFailures
      PrometheusRemoteStorageFailures
      PrometheusRuleFailures
      PrometheusRuleFailures
      PrometheusRuleFailures
      PrometheusRuleFailures
      PrometheusRuleFailures
      PrometheusRuleFailures
      Watchdog

      Similarly to bug #2304076 warning msg's exist on prometheus-k8s pods, but it is not clear if 2304076 affects on this issue.

      ts=2024-09-25T16:12:44.827Z caller=scrape.go:1735 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/k8s-metrics-service-monitor/0 target="https://10.128.0.33:9091/federate?match%5B%5D=%7B_name%3D%27kube_node_status_condition%27%7D&match%5B%5D=%7Bname%3D%27kube_persistentvolume_info%27%7D&match%5B%5D=%7Bname%3D%27kube_storageclass_info%27%7D&match%5B%5D=%7Bname%3D%27kube_persistentvolumeclaim_info%27%7D&match%5B%5D=%7Bname%3D%27kube_deployment_spec_replicas%27%7D&match%5B%5D=%7Bname%3D%27kube_pod_status_phase%27%7D&match%5B%5D=%7Bname%3D%27kubelet_volume_stats_capacity_bytes%27%7D&match%5B%5D=%7Bname%3D%27kubelet_volume_stats_used_bytes%27%7D&match%5B%5D=%7Bname%3D%27node_disk_read_time_seconds_total%27%7D&match%5B%5D=%7Bname%3D%27node_disk_write_time_seconds_total%27%7D&match%5B%5D=%7Bname%3D%27node_disk_reads_completed_total%27%7D&match%5B%5D=%7Bname_%3D%27node_disk_writes_completed_total%27%7D" msg="Error on ingesting out-of-order samples" num_dropped=148
      ts=2024-09-25T16:12:46.273Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.0.21:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=3
      ts=2024-09-25T16:12:58.330Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/noobaa-mgmt-service-monitor/0 target=http://10.130.0.30:8080/metrics/web_server msg="Error on ingesting samples with different value but same timestamp" num_dropped=134
      ts=2024-09-25T16:13:05.478Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/s3-service-monitor/0 target=http://10.130.0.32:7004/ msg="Error on ingesting samples with different value but same timestamp" num_dropped=148

      Version of all relevant components (if applicable):

      OC version:
      Client Version: 4.16.11
      Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
      Server Version: 4.16.11
      Kubernetes Version: v1.29.7+d77deb8

      OCS version:
      ocs-operator.v4.16.2-rhodf OpenShift Container Storage 4.16.2-rhodf ocs-operator.v4.16.1-rhodf Succeeded

      Cluster version:
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.16.11 True False 33h Error while reconciling 4.16.11: the cluster operator insights is not available

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      may impact

      Is there any workaround available to the best of your knowledge?
      no

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Login to cluster
      2. Create PVC, pod, attach and rin IO filling the PVC to 95% of capacity
      3. Open management console, navigate to Observe / Alerts and capture existing alerts

      Actual results:
      no PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical

      Expected results:
      PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical for specific alert are being fired

      Additional info:

      cluster credentials will be granted to collect data necessary for bug

              kmajumder@redhat.com Kaustav Majumder
              rh-ee-dosypenk Daniel Osypenko
              Kaustav Majumder
              Daniel Osypenko Daniel Osypenko
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: