Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: odf-4.19
Affects Version/s: odf-4.16
Component/s: ocs-operator
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2314717
Dev Approval:
Committed
QE Approval:
Committed
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.19
Git Pull Request:
https://github.com/red-hat-storage/ocs-operator/pull/2892, https://github.com/red-hat-storage/ocs-operator/pull/2823
Intelligence Requested:
Market:

Target Version:

odf-4.19

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem (please be detailed as possible and provide log
snippests):

Running test_pvc_expansion_when_full alerts for PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical are not firing

prometheus-operator logs:
level=info ts=2024-09-25T15:35:38.553216149Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=info ts=2024-09-25T15:35:38.593611231Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2024-09-25T15:35:38.754635432Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=info ts=2024-09-25T15:35:39.352659418Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2024-09-25T15:35:39.352681394Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=warn ts=2024-09-25T16:16:34.488279555Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go:326: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488352741Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Alertmanager ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488341065Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:409: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488480237Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PodMonitor ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=error ts=2024-09-25T16:16:34.488520917Z caller=controller.go:189 component=kubelet_endpoints kubelet_object=kube-system/kubelet msg="Failed to synchronize nodes" err="listing nodes failed: Get \"https://172.30.0.1:443/api/v1/nodes\": http2: client connection lost"
level=warn ts=2024-09-25T16:16:34.488537442Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PrometheusRule ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488548191Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PrometheusRule ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488464049Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Prometheus ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488536831Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:411: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488520589Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488579151Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1alpha1.AlertmanagerConfig ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488602736Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.ThanosRuler ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488603637Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.ServiceMonitor ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488629059Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488645182Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488642317Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.Probe ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488534986Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go:328: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488651016Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.StatefulSet ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488656742Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488642006Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488682305Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:118: watch of *v1.PartialObjectMetadata ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488636522Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/prometheus/server/operator.go:488: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2024-09-25T16:16:34.488804733Z caller=klog.go:118 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/prometheus/server/operator.go:486: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=info ts=2024-09-25T16:25:30.538200059Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2024-09-25T16:25:30.538273824Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=info ts=2024-09-25T16:25:34.432611416Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2024-09-25T16:25:34.432635855Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"

these are existing alerts fired now on cluster:
ClusterMonitoringOperatorDeprecatedConfig
AlertmanagerReceiversNotConfigured
PrometheusDuplicateTimestamps
PrometheusDuplicateTimestamps
PrometheusOutOfOrderTimestamps
PrometheusOutOfOrderTimestamps
PrometheusRemoteStorageFailures
PrometheusRemoteStorageFailures
PrometheusRuleFailures
PrometheusRuleFailures
PrometheusRuleFailures
PrometheusRuleFailures
PrometheusRuleFailures
PrometheusRuleFailures
Watchdog

Similarly to bug #2304076 warning msg's exist on prometheus-k8s pods, but it is not clear if 2304076 affects on this issue.

ts=2024-09-25T16:12:44.827Z caller=scrape.go:1735 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/k8s-metrics-service-monitor/0 target="https://10.128.0.33:9091/federate?match%5B%5D=%7B_name%3D%27kube_node_status_condition%27%7D&match%5B%5D=%7Bname%3D%27kube_persistentvolume_info%27%7D&match%5B%5D=%7Bname%3D%27kube_storageclass_info%27%7D&match%5B%5D=%7Bname%3D%27kube_persistentvolumeclaim_info%27%7D&match%5B%5D=%7Bname%3D%27kube_deployment_spec_replicas%27%7D&match%5B%5D=%7Bname%3D%27kube_pod_status_phase%27%7D&match%5B%5D=%7Bname%3D%27kubelet_volume_stats_capacity_bytes%27%7D&match%5B%5D=%7Bname%3D%27kubelet_volume_stats_used_bytes%27%7D&match%5B%5D=%7Bname%3D%27node_disk_read_time_seconds_total%27%7D&match%5B%5D=%7Bname%3D%27node_disk_write_time_seconds_total%27%7D&match%5B%5D=%7Bname%3D%27node_disk_reads_completed_total%27%7D&match%5B%5D=%7Bname_%3D%27node_disk_writes_completed_total%27%7D" msg="Error on ingesting out-of-order samples" num_dropped=148
ts=2024-09-25T16:12:46.273Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.0.21:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=3
ts=2024-09-25T16:12:58.330Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/noobaa-mgmt-service-monitor/0 target=http://10.130.0.30:8080/metrics/web_server msg="Error on ingesting samples with different value but same timestamp" num_dropped=134
ts=2024-09-25T16:13:05.478Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/odf-storage/s3-service-monitor/0 target=http://10.130.0.32:7004/ msg="Error on ingesting samples with different value but same timestamp" num_dropped=148

Version of all relevant components (if applicable):

OC version:
Client Version: 4.16.11
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.11
Kubernetes Version: v1.29.7+d77deb8

OCS version:
ocs-operator.v4.16.2-rhodf OpenShift Container Storage 4.16.2-rhodf ocs-operator.v4.16.1-rhodf Succeeded

Cluster version:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.11 True False 33h Error while reconciling 4.16.11: the cluster operator insights is not available

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
may impact

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Login to cluster
2. Create PVC, pod, attach and rin IO filling the PVC to 95% of capacity
3. Open management console, navigate to Observe / Alerts and capture existing alerts

Actual results:
no PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical

Expected results:
PersistentVolumeUsageNearFull and PersistentVolumeUsageCritical for specific alert are being fired

Additional info:

cluster credentials will be granted to collect data necessary for bug

is cloned by

DFBUGS-1475 [2314717] [ODF on ROSA HCP] [4.17] PVC utilization alerts are not firing

New

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty