Details
-
Bug
-
Resolution: Done
-
Major
-
None
-
4.10
-
Important
-
Rejected
-
Unspecified
-
If docs needed, set a value
Description
+++ This bug was initially created as a clone of Bug #2048575 +++
Description of problem:
This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.
Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954
Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816
The failure occurred before any of our test code ran, indicating that the issue was during installation.
How reproducible:
This is intermittent but frequent, the majority of single node Prow jobs fail.
Steps to Reproduce:
1. Run the e2e job for single node.
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod
Here is an example:
```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880 1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```
Actual results:
The IP reconciliation fails with the error
"failed to retrieve all IP pools: context deadline exceeded"
Expected results:
The job should not fail.
Additional info:
— Additional comment from Douglas Smith on 2022-01-31 22:11:50 UTC —
I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.
The logs I got were:
```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```
However – if I create a job from the cronjob manually, the jobs completes successfully, e.g.
```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```
I see the job complete like so:
```
$ oc get pods | grep -iP "name|testrun"
NAME READY STATUS RESTARTS AGE
testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s
```
This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)