Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.0
Component/s: Networking / ptp
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:

4.16.0
Release Blocker:
None
Sprint:
CNF RAN Sprint 249, CNF RAN Sprint 250
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Soak testing PTP CPU Utilization failed due to invalid Soak testing PTP CPU Utilization from rate(container_cpu_usage_seconds_total) query.

This could happen in any of the dualnicbc-parallel, bc-parallel, or oc-parallel test suites.

The issue is intermittent. The frequency of this issue occurring is HIGH.

Actual Result:

ft5.1: �[36mINFO   �[0m[Feb 18 00:44:14.868][ptp.go: 135] CPU Utilization TC Config: {CpuTestSpec:{TestSpec:{Enable:true FailureThreshold:3 Duration:5} CustomParams:{PromTimeWindow:70s Node:{CpuUsageThreshold:100} Pod:[{PodType:ptp-operator Container:<nil> CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:<nil> CpuUsageThreshold:80} {PodType:linuxptp-daemon Container:cloud-event-proxy CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:linuxptp-daemon-container CpuUsageThreshold:40}]}} Description:The test measures PTP CPU usage and fails if >15mcores} 
�[36mINFO   �[0m[Feb 18 00:44:14.964][ptp.go: 165] Configured rate timeWindow: 1m10s, cadvisor scrape interval: 30 secs. 
�[36mINFO   �[0m[Feb 18 00:45:14.965][ptp.go: 186] Running test for 5m0s (failure threshold: 3) 
�[36mINFO   �[0m[Feb 18 00:46:14.965][ptp.go: 196] Retrieving cpu usage of the ptp pods.        
�[37mDEBUG  �[0m[Feb 18 00:46:14.965][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 0 
�[33mWARNING�[0m[Feb 18 00:46:15.134][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
�[33mWARNING�[0m[Feb 18 00:46:15.134][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
�[37mDEBUG  �[0m[Feb 18 00:46:15.134][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 1 
�[33mWARNING�[0m[Feb 18 00:46:16.295][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
�[33mWARNING�[0m[Feb 18 00:46:16.295][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
�[37mDEBUG  �[0m[Feb 18 00:46:16.295][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 2 
�[33mWARNING�[0m[Feb 18 00:46:17.460][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
�[33mWARNING�[0m[Feb 18 00:46:17.460][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
�[37mDEBUG  �[0m[Feb 18 00:46:17.460][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 3 
�[33mWARNING�[0m[Feb 18 00:46:18.622][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
�[33mWARNING�[0m[Feb 18 00:46:18.622][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
�[37mDEBUG  �[0m[Feb 18 00:46:18.622][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]), attempt 4 
�[33mWARNING�[0m[Feb 18 00:46:19.771][ptptesthelper.go: 481] Invalid result vector length in prometheus response: {Status:success Error: Data:{ResultType:vector Result:0xc00012c060}} 
�[33mWARNING�[0m[Feb 18 00:46:19.771][prometheus.go: 135] Failed to get a prometheus response for query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-ghrpk", container=""}[1m10s]): <nil> 
�

Expected Result:

        
pt8.1: �[36mINFO   �[0m[Feb 17 17:05:22.685][ptp.go: 135] CPU Utilization TC Config: {CpuTestSpec:{TestSpec:{Enable:true FailureThreshold:3 Duration:5} CustomParams:{PromTimeWindow:70s Node:{CpuUsageThreshold:100} Pod:[{PodType:ptp-operator Container:<nil> CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:<nil> CpuUsageThreshold:80} {PodType:linuxptp-daemon Container:cloud-event-proxy CpuUsageThreshold:30} {PodType:linuxptp-daemon Container:linuxptp-daemon-container CpuUsageThreshold:40}]}} Description:The test measures PTP CPU usage and fails if >15mcores} 
�[36mINFO   �[0m[Feb 17 17:05:22.783][ptp.go: 165] Configured rate timeWindow: 1m10s, cadvisor scrape interval: 30 secs. 
�[36mINFO   �[0m[Feb 17 17:06:22.784][ptp.go: 186] Running test for 5m0s (failure threshold: 3) 
�[36mINFO   �[0m[Feb 17 17:07:22.785][ptp.go: 196] Retrieving cpu usage of the ptp pods.        
�[37mDEBUG  �[0m[Feb 17 17:07:22.785][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-7v69m", container=""}[1m10s]), attempt 0 
�[37mDEBUG  �[0m[Feb 17 17:07:22.922][ptptesthelper.go: 497] Pod: linuxptp-daemon-7v69m, container:  (ns openshift-ptp) cpu usage: 0.0005737646459500775 (ts: 2024-02-17 17:06:46.349 +0000 UTC) 
�[36mINFO   �[0m[Feb 17 17:07:22.922][ptp.go: 232] Node master1.ptpcimno.telco5gran.eng.rdu2.redhat.com: pod: linuxptp-daemon-7v69m (ns:openshift-ptp) cpu usage: 0.00057 
�[37mDEBUG  �[0m[Feb 17 17:07:22.922][ptp.go: 240] Checking cpu usage of pod linuxptp-daemon-7v69m. Cpu Usage: 0.00057 - Threshold: 0.08000 
�[37mDEBUG  �[0m[Feb 17 17:07:22.922][prometheus.go: 119] Querying prometheus, query rate(container_cpu_usage_seconds_total{namespace="openshift-ptp", pod="linuxptp-daemon-7v69m", container="cloud-event-proxy"}[1m10s]), attempt 0 
�[37mDEBUG  �[0m[Feb 17 17:07:23.074][ptptesthelper.go: 497] Pod: linuxptp-daemon-7v69m, container: cloud-event-proxy (ns openshift-ptp) cpu usage: 0.00010342890579286081 (ts: 2024-02-17 17:06:46.5 +0000 UTC) 
�[36mINFO   �[0m[Feb 17 17:07:23.074][ptp.go: 254] Node master1.ptpcimno.telco5gran.eng.rdu2.redhat.com: pod: linuxptp-daemon-7v69m, container: cloud-event-proxy (ns:openshift-ptp) cpu usage: 0.00010 
�[37mDEBUG  �[0m[Feb 17 17:07:23.074][ptp.go: 262] Checking cpu usage of container cloud-event-proxy (pod linuxptp-daemon-7v69m). Cpu Usage: 0.00010 - Threshold: 0.03000

Not sure if it is related but promethous pod has a lot of errors like these during the test:

ts=2024-02-18T00:44:22.011Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:46:11.747Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.102:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:46:45.202Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.110:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:47:41.743Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.102:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:48:52.046Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:53:22.107Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:53:52.051Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https://10.8.34.105:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=13
ts=2024-02-18T00:56:22.054Z caller=scrape.go:1655 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kubelet/1 target=https:

links to

openshift/ptp-operator#445: OCPBUGS-29634: fix ci issues

openshift/ptp-operator#449: OCPBUGS-29634: fix CPU Utilization soak test

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide