Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-4988

ocs-metrics-exporter does not provide metrics during cache refresh

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • odf-4.16.17
    • ceph-monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • x86_64
    • ?
    • ?
    • Moderate
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      There are gaps in the metrics produced by the ocs-metrics-exporter caused by the cache refresh.

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      • IPI
      • Platform: Baremetal

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      • Internal

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      • Openshift 4.16.43
      • ODF 4.16.11

      Does this issue impact your ability to continue to work with the product?

      • No

      Is there any workaround available to the best of your knowledge?

      • No

      Can this issue be reproduced? If so, please provide the hit rate

      • Yes
      • It is reproducible on environments with higher number of attached volumes
      •  
        # oc rsh -c ocs-metrics-exporter ocs-metrics-exporter-79665688cd-mlc97 
        # while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done 

       

      Can this issue be reproduced from the UI?

      • No, only gaps in metrics are visible

      If this is a regression, please provide more details to justify this:

      • No

      Steps to Reproduce:

      1.Connect 100 rbd devices

      2.Run the script above 

      3.Watch for timeouts

      The exact date and time when the issue was observed, including timezone details:

      Every 5 min

      Actual results:

      while true; do echo -n "`date` - "; curl -sS --max-time 3 -w "%{stderr} %{http_code}\n" 127.0.0.1:8080/metrics 1>/dev/null; sleep 1; done
      Tue Dec 23 08:50:36 UTC 2025 -  200
      Tue Dec 23 08:50:37 UTC 2025 -  200
      Tue Dec 23 08:50:38 UTC 2025 -  200
      Tue Dec 23 08:50:40 UTC 2025 -  200
      Tue Dec 23 08:50:41 UTC 2025 -  200
      Tue Dec 23 08:50:42 UTC 2025 -  200
      Tue Dec 23 08:50:43 UTC 2025 -  200
      Tue Dec 23 08:50:44 UTC 2025 -  200
      Tue Dec 23 08:50:45 UTC 2025 -  200
      Tue Dec 23 08:50:47 UTC 2025 -  200
      Tue Dec 23 08:50:48 UTC 2025 -  200
      Tue Dec 23 08:50:49 UTC 2025 -  200
      Tue Dec 23 08:50:50 UTC 2025 -  200
      Tue Dec 23 08:50:51 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
       000
      Tue Dec 23 08:50:55 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
       000
      Tue Dec 23 08:50:59 UTC 2025 - curl: (28) Operation timed out after 3001 milliseconds with 0 bytes received
       000
      
      ...
      
      Tue Dec 23 08:53:41 UTC 2025 - curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received
       000
      Tue Dec 23 08:53:45 UTC 2025 -  200
      Tue Dec 23 08:53:47 UTC 2025 -  200

      Expected results:

      • No timeouts

      Logs collected and log location:

       

      {"level":"error","ts":1766479842.3808334,"caller":"metrics/main.go:27","msg":"error encoding and sending metric family: write tcp 10.131.8.154:8080->10.131.8.2:50098: write: broken pipe","stacktrace":"main.promhttplogger.Println\n\t/src/remote_source/app/metrics/main.go:27\ngithub>
      {"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
      {"level":"info","ts":1766479850.5250835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-112c7aae"}
      {"level":"info","ts":1766479850.5251598,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-1257d5c0"}
      {"level":"info","ts":1766479850.5251713,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-12cd84c5"}
      {"level":"info","ts":1766479850.5251808,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume local-pv-14e63ec7"}
      ...
      {"level":"info","ts":1766479860.0986178,"caller":"cache/ceph-blocklist.go:108","msg":"Blocklist store sync started 2025-12-23 08:51:00.098520699 +0000 UTC m=+64103.494130213"}
      {"level":"info","ts":1766479861.7315803,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-04ee2f10-5565-41a8-a9ff-c9a09fa90022"}
      {"level":"info","ts":1766479864.4305835,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-081197f7-f48e-4f65-8721-a704fac4d2c9"}
      {"level":"info","ts":1766479871.608866,"caller":"cache/rbd-mirror.go:301","msg":"RBD mirror store resync started at 2025-12-23 08:51:11.608785763 +0000 UTC m=+64115.004395277"}
      {"level":"info","ts":1766479871.608966,"caller":"cache/rbd-mirror.go:326","msg":"RBD mirror store resync ended at 2025-12-23 08:51:11.608954064 +0000 UTC m=+64115.004563572"}
      {"level":"info","ts":1766479893.5390751,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4a3e2f8a-35e6-4676-a66e-8b9d6059372b"}
      {"level":"info","ts":1766479894.7108696,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-4c5a475b-d56d-4fe3-ba18-b59784676b73"}
      ...
      {"level":"info","ts":1766480005.6306944,"caller":"cache/pv.go:190","msg":"Skipping non Ceph CSI RBD volume pvc-e8ed5c29-24df-4dd9-981c-25b731b7b1dc"}
      {"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}
      
      

      Additional info:

      Notice the lines: 

      {"level":"info","ts":1766479850.501043,"caller":"cache/pv.go:329","msg":"PV store Resync started at 2025-12-23 08:50:50.500957127 +0000 UTC m=+64093.896566641"}
      ...
      {"level":"info","ts":1766480006.3181431,"caller":"cache/pv.go:345","msg":"PV store Resync ended at 2025-12-23 08:53:26.318103786 +0000 UTC m=+64249.713713299"}
      
      

      And compere with the timeouts from the script

      • The last working:  
        Tue Dec 23 08:50:50 UTC 2025 -  200

        The Fitst working again: 

        Tue Dec 23 08:53:45 UTC 2025 -  200

         

      The times are exactly the same.

      The following code shows that we take a lock{{ "p.Mutex.Lock()" }}for the entire cash refresh. The cache refresh is very long time operation and blocks the metrics production for very long time as the reader are waiting until the lock is released. We shoudl rather collect the data to a temporaty strcture and then take the lock just for swapping the data in the cache. 

      https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/cache/pv.go:

       

        // Resync implements the Resync method of the store interface.              
        func (p *PersistentVolumeStore) Resync() error {                            
          klog.Infof("PV store Resync started at %v", time.Now())   
                      
          pvList, err := p.kubeClient.CoreV1().PersistentVolumes().List(context.Background(), metav1.ListOptions{})
          if err != nil {                                                           
            return fmt.Errorf("failed to list persistent volumes: %v", err)         
          }                                                                         
          p.Mutex.Lock()                     
          defer p.Mutex.Unlock()                                                    
          for _, pv := range pvList.Items {                                          
            err := p.add(&pv)                                                        
            if err != nil {                                                         
              return fmt.Errorf("failed to process PV: %s err: %v", pv.Name, err)    
            }                                                                     
          }                                                                          
          klog.Infof("PV store Resync ended at %v", time.Now())                      
          return nil 
        }                                                                             
      
      

       

              dkamboj@redhat.com Divyansh Kamboj
              rhn-support-rhodain1 Roman Hodain
              Divyansh Kamboj
              Harish NV Rao Harish NV Rao
              Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

                Created:
                Updated: