Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.z
Component/s: Networking / router
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
NE Sprint 257
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

  SREP started to receive an increase in errors on the console probes and noticed frequent restarts of the router-default pods

What triage steps have been taken so far?:

Console probes are failing, the `router-default` pods are experiencing timeouts and the sdn-controller pod has warnings about issues with RBAC.

Issues are persistent.

What logs have been reviewed (attach them?):

blackbox-exporter probes for the console are failing.

ts=2024-07-09T09:40:31.413312762Z caller=main.go:189 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Beginning probe" probe=http timeout_seconds=5
ts=2024-07-09T09:40:31.413415127Z caller=http.go:328 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Resolving target address" target=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com ip_protocol=ip6
ts=2024-07-09T09:40:31.41467833Z caller=http.go:328 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Resolved target address" target=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com ip=54.237.159.91
ts=2024-07-09T09:40:31.414763196Z caller=client.go:252 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Making HTTP request" url=https://54.237.159.91/health host=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com
ts=2024-07-09T09:40:36.417621502Z caller=handler.go:119 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=error msg="Error for HTTP request" err="Get \"https://54.237.159.91/health\": context deadline exceeded"
ts=2024-07-09T09:40:36.417670499Z caller=handler.go:119 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Response timings for roundtrip" roundtrip=0 start=2024-07-09T09:40:31.41484715Z dnsDone=2024-07-09T09:40:31.41484715Z connectDone=2024-07-09T09:40:31.41583556Z gotConn=0001-01-01T00:00:00Z responseStart=0001-01-01T00:00:00Z tlsStart=2024-07-09T09:40:31.41585747Z tlsDone=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2024-07-09T09:40:36.417692223Z caller=main.go:189 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=error msg="Probe failed" duration_seconds=5.004348192

$ oc describe pod router-default-7bf67dcb5c-h7x2z -n openshift-ingress
Last State: Terminated Reason: Error Message: http failed: read tcp 127.0.0.1:40526->127.0.0.1:80: i/o timeout 
I0705 09:50:58.430264 1 healthz.go:261] backend-proxy-http check failed: healthz [-]backend-proxy-http failed: read tcp 127.0.0.1:53010->127.0.0.1:80: i/o timeout 
I0705 09:50:58.430273 1 healthz.go:261] backend-proxy-http check failed: healthz [-]backend-proxy-http failed: read tcp 127.0.0.1:53002->127.0.0.1:80: i/o timeout

$ oc logs sdn-controller-hlq4w -n openshift-sdn
I0705 10:51:49.147384 1 master.go:56] Initializing SDN master 
W0705 10:51:49.161823 1 master.go:156] Failed to list pods: pods is forbidden: User "system:serviceaccount:openshift-sdn:sdn-controller" cannot list resource "pods" in API group "" at the cluster scope 
W0705 10:51:49.163066 1 master.go:161] Failed to list services: services is forbidden: User "system:serviceaccount:openshift-sdn:sdn-controller" cannot list resource "services" in API group "" at the cluster scope

The cluster completed an upgrade to 4.15.19 shortly before we started seeing the issue.

Additional info:

Ongoing thread with networking team - https://redhat-internal.slack.com/archives/CDCP2LA9L/p1720179270455589

Assignee:: Andrey Lebedev

Reporter:: Raphael But

Need Info From:: None

Contributors:: None

QA Contact:: Hongan Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/07/09 10:05 AM

Updated:: 2025/07/22 11:23 AM

Resolved:: 2024/07/23 6:02 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates