Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: odf-4.16.17
Component/s: ceph-monitoring
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
Committed
Docs Approval:
?
Architecture:

x86_64
PM Approval:
?
QE Approval:
?
Target Release:

odf-4.22
Intelligence Requested:
Market:

Severity:
Moderate

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

There are gaps in the metrics produced by the ocs-metrics-exporter caused by the cache refresh.

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

IPI
Platform: Baremetal

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

Openshift 4.16.43
ODF 4.16.11

Does this issue impact your ability to continue to work with the product?

No

Is there any workaround available to the best of your knowledge?

No

Can this issue be reproduced? If so, please provide the hit rate

Yes
It is reproducible on environments with higher number of attached volumes

# oc rsh -c ocs-metrics-exporter ocs-metrics-exporter-79665688cd-mlc97 
# while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done

Can this issue be reproduced from the UI?

No, only gaps in metrics are visible

If this is a regression, please provide more details to justify this:

No

Steps to Reproduce:

1.Connect 100 rbd devices

2.Run the script above

3.Watch for timeouts

The exact date and time when the issue was observed, including timezone details:

Every 5 min

Actual results:

while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done
Tue Dec 23 08:50:36 UTC 2025 -  200
Tue Dec 23 08:50:37 UTC 2025 -  200
Tue Dec 23 08:50:38 UTC 2025 -  200
Tue Dec 23 08:50:40 UTC 2025 -  200
Tue Dec 23 08:50:41 UTC 2025 -  200
Tue Dec 23 08:50:42 UTC 2025 -  200
Tue Dec 23 08:50:43 UTC 2025 -  200
Tue Dec 23 08:50:44 UTC 2025 -  200
Tue Dec 23 08:50:45 UTC 2025 -  200
Tue Dec 23 08:50:47 UTC 2025 -  200
Tue Dec 23 08:50:48 UTC 2025 -  200
Tue Dec 23 08:50:49 UTC 2025 -  200
Tue Dec 23 08:50:50 UTC 2025 -  200
Tue Dec 23 08:50:51 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
 000
Tue Dec 23 08:50:55 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
 000
Tue Dec 23 08:50:59 UTC 2025 - curl: (28) Operation timed out after 3001 milliseconds with 0 bytes received
 000

...

Tue Dec 23 08:53:41 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
 000
Tue Dec 23 08:53:45 UTC 2025 -  200
Tue Dec 23 08:53:47 UTC 2025 -  200

Expected results:

No timeouts

Logs collected and log location:

{"level":"error","ts":1766479842.3808334,"caller":"metrics/main.go:27","msg":"error encoding and sending metric family: write tcp 10.131.8.154:8080->10.131.8.2:50098: write: broken pipe","stacktrace":"main.promhttplogger.Println\n\t/src/remote_source/app/metrics/main.go:27\ngithub>
{"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
{"level":"info","ts":1766479850.5250835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-112c7aae"}
{"level":"info","ts":1766479850.5251598,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-1257d5c0"}
{"level":"info","ts":1766479850.5251713,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-12cd84c5"}
{"level":"info","ts":1766479850.5251808,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-14e63ec7"}
...
{"level":"info","ts":1766479860.0986178,"caller":"cache/ceph-blocklist.go:108","msg":"Blocklist store sync started 2025-12-23 08:51:00.098520699 +0000 UTC m=+64103.494130213"}
{"level":"info","ts":1766479861.7315803,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-04ee2f10-5565-41a8-a9ff-c9a09fa90022"}
{"level":"info","ts":1766479864.4305835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-081197f7-f48e-4f65-8721-a704fac4d2c9"}
{"level":"info","ts":1766479871.608866,"caller":"cache/rbd-mirror.go:301","msg":"RBD mirror store resync started at 2025-12-23 08:51:11.608785763 +0000 UTC m=+64115.004395277"}
{"level":"info","ts":1766479871.608966,"caller":"cache/rbd-mirror.go:326","msg":"RBD mirror store resync ended at 2025-12-23 08:51:11.608954064 +0000 UTC m=+64115.004563572"}
{"level":"info","ts":1766479893.5390751,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4a3e2f8a-35e6-4676-a66e-8b9d6059372b"}
{"level":"info","ts":1766479894.7108696,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4c5a475b-d56d-4fe3-ba18-b59784676b73"}
...
{"level":"info","ts":1766480005.6306944,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-e8ed5c29-24df-4dd9-981c-25b731b7b1dc"}
{"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}

Additional info:

Notice the lines:

{"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
...
{"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}

And compere with the timeouts from the script

The last working:

Tue Dec 23 08:50:50 UTC 2025 -  200

The Fitst working again:

Tue Dec 23 08:53:45 UTC 2025 -  200

The times are exactly the same.

The following code shows that we take a lock{{ "p.Mutex.Lock()" }}for the entire cash refresh. The cache refresh is very long time operation and blocks the metrics production for very long time as the reader are waiting until the lock is released. We shoudl rather collect the data to a temporaty strcture and then take the lock just for swapping the data in the cache.

https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/cache/pv.go:

  // Resync implements the Resync method of the store interface.              
  func (p *PersistentVolumeStore) Resync() error {                            
    klog.Infof("PV store Resync started at %v", time.Now())   
                
    pvList, err := p.kubeClient.CoreV1().PersistentVolumes().List(context.Background(), metav1.ListOptions{})
    if err != nil {                                                           
      return fmt.Errorf("failed to list persistent volumes: %v", err)         
    }                                                                         
    p.Mutex.Lock()                     
    defer p.Mutex.Unlock()                                                    
    for _, pv := range pvList.Items {                                          
      err := p.add(&pv)                                                        
      if err != nil {                                                         
        return fmt.Errorf("failed to process PV: %s err: %v", pv.Name, err)    
      }                                                                     
    }                                                                          
    klog.Infof("PV store Resync ended at %v", time.Now())                      
    return nil 
  }

links to

red-hat-storage/ocs-operator#3654: metrics: fix lock contention during PV cache refresh

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty