-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
odf-4.16.17
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
There are gaps in the metrics produced by the ocs-metrics-exporter caused by the cache refresh.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
- IPI
- Platform: Baremetal
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
- Internal
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
- Openshift 4.16.43
- ODF 4.16.11
Does this issue impact your ability to continue to work with the product?
- No
Is there any workaround available to the best of your knowledge?
- No
Can this issue be reproduced? If so, please provide the hit rate
- Yes
- It is reproducible on environments with higher number of attached volumes
-
# oc rsh -c ocs-metrics-exporter ocs-metrics-exporter-79665688cd-mlc97 # while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done
Can this issue be reproduced from the UI?
- No, only gaps in metrics are visible
If this is a regression, please provide more details to justify this:
- No
Steps to Reproduce:
1.Connect 100 rbd devices
2.Run the script above
3.Watch for timeouts
The exact date and time when the issue was observed, including timezone details:
Every 5 min
Actual results:
while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done Tue Dec 23 08:50:36 UTC 2025 - 200 Tue Dec 23 08:50:37 UTC 2025 - 200 Tue Dec 23 08:50:38 UTC 2025 - 200 Tue Dec 23 08:50:40 UTC 2025 - 200 Tue Dec 23 08:50:41 UTC 2025 - 200 Tue Dec 23 08:50:42 UTC 2025 - 200 Tue Dec 23 08:50:43 UTC 2025 - 200 Tue Dec 23 08:50:44 UTC 2025 - 200 Tue Dec 23 08:50:45 UTC 2025 - 200 Tue Dec 23 08:50:47 UTC 2025 - 200 Tue Dec 23 08:50:48 UTC 2025 - 200 Tue Dec 23 08:50:49 UTC 2025 - 200 Tue Dec 23 08:50:50 UTC 2025 - 200 Tue Dec 23 08:50:51 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received 000 Tue Dec 23 08:50:55 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received 000 Tue Dec 23 08:50:59 UTC 2025 - curl: (28) Operation timed out after 3001 milliseconds with 0 bytes received 000 ... Tue Dec 23 08:53:41 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received 000 Tue Dec 23 08:53:45 UTC 2025 - 200 Tue Dec 23 08:53:47 UTC 2025 - 200
Expected results:
- No timeouts
Logs collected and log location:
{"level":"error","ts":1766479842.3808334,"caller":"metrics/main.go:27","msg":"error encoding and sending metric family: write tcp 10.131.8.154:8080->10.131.8.2:50098: write: broken pipe","stacktrace":"main.promhttplogger.Println\n\t/src/remote_source/app/metrics/main.go:27\ngithub>
{"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
{"level":"info","ts":1766479850.5250835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-112c7aae"}
{"level":"info","ts":1766479850.5251598,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-1257d5c0"}
{"level":"info","ts":1766479850.5251713,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-12cd84c5"}
{"level":"info","ts":1766479850.5251808,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-14e63ec7"}
...
{"level":"info","ts":1766479860.0986178,"caller":"cache/ceph-blocklist.go:108","msg":"Blocklist store sync started 2025-12-23 08:51:00.098520699 +0000 UTC m=+64103.494130213"}
{"level":"info","ts":1766479861.7315803,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-04ee2f10-5565-41a8-a9ff-c9a09fa90022"}
{"level":"info","ts":1766479864.4305835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-081197f7-f48e-4f65-8721-a704fac4d2c9"}
{"level":"info","ts":1766479871.608866,"caller":"cache/rbd-mirror.go:301","msg":"RBD mirror store resync started at 2025-12-23 08:51:11.608785763 +0000 UTC m=+64115.004395277"}
{"level":"info","ts":1766479871.608966,"caller":"cache/rbd-mirror.go:326","msg":"RBD mirror store resync ended at 2025-12-23 08:51:11.608954064 +0000 UTC m=+64115.004563572"}
{"level":"info","ts":1766479893.5390751,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4a3e2f8a-35e6-4676-a66e-8b9d6059372b"}
{"level":"info","ts":1766479894.7108696,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4c5a475b-d56d-4fe3-ba18-b59784676b73"}
...
{"level":"info","ts":1766480005.6306944,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-e8ed5c29-24df-4dd9-981c-25b731b7b1dc"}
{"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}
Additional info:
Notice the lines:
{"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
...
{"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}
And compere with the timeouts from the script
- The last working:
Tue Dec 23 08:50:50 UTC 2025 - 200
The Fitst working again:
Tue Dec 23 08:53:45 UTC 2025 - 200
The times are exactly the same.
The following code shows that we take a lock{{ "p.Mutex.Lock()" }}for the entire cash refresh. The cache refresh is very long time operation and blocks the metrics production for very long time as the reader are waiting until the lock is released. We shoudl rather collect the data to a temporaty strcture and then take the lock just for swapping the data in the cache.
https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/cache/pv.go:
// Resync implements the Resync method of the store interface. func (p *PersistentVolumeStore) Resync() error { klog.Infof("PV store Resync started at %v", time.Now()) pvList, err := p.kubeClient.CoreV1().PersistentVolumes().List(context.Background(), metav1.ListOptions{}) if err != nil { return fmt.Errorf("failed to list persistent volumes: %v", err) } p.Mutex.Lock() defer p.Mutex.Unlock() for _, pv := range pvList.Items { err := p.add(&pv) if err != nil { return fmt.Errorf("failed to process PV: %s err: %v", pv.Name, err) } } klog.Infof("PV store Resync ended at %v", time.Now()) return nil }