Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.15, 4.16
-
No
-
Rejected
-
False
-
Description
A frequent cause of test run failure in the e2e-metal-ipi-ovn-ipv6 job is
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers
The specific "excessive" error is something like:
event [namespace/openshift-e2e-loki node/master-1.ostest.test.metalkube.org pod/loki-promtail-s2ggf hmsg/12a03e9173 - Back-off restarting failed container prod-bearer-token in pod loki-promtail-s2ggf_openshift-e2e-loki(5cfc8a21-bc04-4c34-a68e-7c60e04834ea)] happened 408 times
Poking around in the must-gather reveals that the prod-bearer-token container is exiting because:
level=info name=token-refresher ts=2024-02-06T13:27:36.138331121Z caller=main.go:169 msg=token-refresher
2024/02/06 13:27:36 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": proxyconnect tcp: dial tcp [fd00:1101::1]:8213: connect: connection refused
(Aside: sso.redhat.com has an IPv6 address, so theoretically we could no_proxy it?)
But since this works fine in other runs, it seems like the problem must be that sometimes squid crashes at some point? (Although I'd expect that to cause more failures than this, so maybe not?)
Unfortunately, I can't find any information about the state of the squid proxy in the e2e artifacts; squid is run by hand via podman, so its output isn't captured by must-gather, and it doesn't seem to log anything to the journal either. (Aside: that that script does "ssh root@${IP}{}" but then prefixes every command with "sudo"...)
So that's as far as I got with debugging this...
Attachments
Issue Links
- clones
-
OCPBUGS-29478 squid proxy sometimes crashing/unreachable/? in e2e-metal-ipi-ovn-ipv6 jobs
- Closed