Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29634

PTP CI: invalid prometheus response in soak testing PTP CPU Utilization (Intermittent High)

XMLWordPrintable

    • No
    • CNF RAN Sprint 249, CNF RAN Sprint 250
    • 2
    • False
    • Hide

      None

      Show
      None

      Soak testing PTP CPU Utilization failed due to invalid Soak testing PTP CPU Utilization from rate(container_cpu_usage_seconds_total) query.

      This could happen in any of the dualnicbc-parallel, bc-parallel, or oc-parallel test suites.

      The issue is intermittent. The frequency of this issue occurring is HIGH.

      Actual Result:

      ft5.1: �[36mINFO   �[0m[Feb 18 00:44:14.868][ptp.go: 135] CPU Utilization TC Config: {CpuTestSpec:{TestSpec:{Enable:true FailureThreshold:3 Duration:5} CustomParams:{PromTimeWindow:70s Node:{CpuUsageThreshold:100} Pod:[{PodType:ptp-operator Container:<nil> CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:<nil> CpuUsageThreshold:80} {PodType:linuxptp-daemon Container:cloud-event-proxy CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:linuxptp-daemon-container CpuUsageThreshold:40}]}} Description:The test measures PTP CPU usage and fails if >15mcores} 
      �[36mINFO   �[0m[Feb 18 00:44:14.964][ptp.go: 165] Configured rate timeWindow: 1m10s, cadvisor scrape interval: 30 secs. 
      �[36mINFO   �[0m[Feb 18 00:45:14.965][ptp.go: 186] Running test for 5m0s (failure threshold: 3) 
      �[36mINFO   �[0m[Feb 18 00:46:14.965][ptp.go: 196] Retrieving cpu usage of the ptp pods.        
      �[37mDEBUG  �[0m[Feb 18 00:46:14.965][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 0 
      �[33mWARNING�[0m[Feb 18 00:46:15.134][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
      �[33mWARNING�[0m[Feb 18 00:46:15.134][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
      �[37mDEBUG  �[0m[Feb 18 00:46:15.134][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 1 
      �[33mWARNING�[0m[Feb 18 00:46:16.295][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
      �[33mWARNING�[0m[Feb 18 00:46:16.295][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
      �[37mDEBUG  �[0m[Feb 18 00:46:16.295][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 2 
      �[33mWARNING�[0m[Feb 18 00:46:17.460][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
      �[33mWARNING�[0m[Feb 18 00:46:17.460][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
      �[37mDEBUG  �[0m[Feb 18 00:46:17.460][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 3 
      �[33mWARNING�[0m[Feb 18 00:46:18.622][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
      �[33mWARNING�[0m[Feb 18 00:46:18.622][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
      �[37mDEBUG  �[0m[Feb 18 00:46:18.622][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 4 
      �[33mWARNING�[0m[Feb 18 00:46:19.771][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
      �[33mWARNING�[0m[Feb 18 00:46:19.771][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
      � 

      Expected Result:

              
      pt8.1: �[36mINFO   �[0m[Feb 17 17:05:22.685][ptp.go: 135] CPU Utilization TC Config: {CpuTestSpec:{TestSpec:{Enable:true FailureThreshold:3 Duration:5} CustomParams:{PromTimeWindow:70s Node:{CpuUsageThreshold:100} Pod:[{PodType:ptp-operator Container:<nil> CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:<nil> CpuUsageThreshold:80} {PodType:linuxptp-daemon Container:cloud-event-proxy CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:linuxptp-daemon-container CpuUsageThreshold:40}]}} Description:The test measures PTP CPU usage and fails if >15mcores} 
      �[36mINFO   �[0m[Feb 17 17:05:22.783][ptp.go: 165] Configured rate timeWindow: 1m10s, cadvisor scrape interval: 30 secs. 
      �[36mINFO   �[0m[Feb 17 17:06:22.784][ptp.go: 186] Running test for 5m0s (failure threshold: 3) 
      �[36mINFO   �[0m[Feb 17 17:07:22.785][ptp.go: 196] Retrieving cpu usage of the ptp pods.        
      �[37mDEBUG  �[0m[Feb 17 17:07:22.785][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-7v69m", container=""}[1m10s]), attempt 0 
      �[37mDEBUG  �[0m[Feb 17 17:07:22.922][ptptesthelper.go: 497] Pod: linuxptp-daemon-7v69m, container:  (ns openshift-ptp) cpu usage: 0.0005737646459500775 (ts: 2024-02-17 17:06:46.349 +0000 UTC) 
      �[36mINFO   �[0m[Feb 17 17:07:22.922][ptp.go: 232] Node master1.ptpcimno.telco5gran.eng.rdu2.redhat.com: pod: linuxptp-daemon-7v69m (ns:openshift-ptp) cpu usage: 0.00057 
      �[37mDEBUG  �[0m[Feb 17 17:07:22.922][ptp.go: 240] Checking cpu usage of pod linuxptp-daemon-7v69m. Cpu Usage: 0.00057 - Threshold: 0.08000 
      �[37mDEBUG  �[0m[Feb 17 17:07:22.922][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-7v69m", container="cloud-event-proxy"}[1m10s]), attempt 0 
      �[37mDEBUG  �[0m[Feb 17 17:07:23.074][ptptesthelper.go: 497] Pod: linuxptp-daemon-7v69m, container: cloud-event-proxy (ns openshift-ptp) cpu usage: 0.00010342890579286081 (ts: 2024-02-17 17:06:46.5 +0000 UTC) 
      �[36mINFO   �[0m[Feb 17 17:07:23.074][ptp.go: 254] Node master1.ptpcimno.telco5gran.eng.rdu2.redhat.com: pod: linuxptp-daemon-7v69m, container: cloud-event-proxy (ns:openshift-ptp) cpu usage: 0.00010 
      �[37mDEBUG  �[0m[Feb 17 17:07:23.074][ptp.go: 262] Checking cpu usage of container cloud-event-proxy (pod linuxptp-daemon-7v69m). Cpu Usage: 0.00010 - Threshold: 0.03000 
      

      Not sure if it is related but promethous pod has a lot of errors like these during the test:

      ts=2024-02-18T00:44:22.011Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:46:11.747Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.102:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:46:45.202Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.110:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:47:41.743Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.102:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:48:52.046Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:53:22.107Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:53:52.051Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
      ts=2024-02-18T00:56:22.054Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https: 

            jacding@redhat.com Jack Ding
            jacding@redhat.com Jack Ding
            Hen Shay Hassid Hen Shay Hassid
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: