Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Networking / multus
Labels:
- migrated_from_bz
- needs_manual_sfdc

Severity:
Important
Regression:
None
Release Blocker:
Rejected
Architecture:

Unspecified
Release Note Type:
If docs needed, set a value
Internal Whiteboard:
Target Version:

4.10.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

+++ This bug was initially created as a clone of Bug #2048575 +++

Description of problem:

This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

The failure occurred before any of our test code ran, indicating that the issue was during installation.

How reproducible:

This is intermittent but frequent, the majority of single node Prow jobs fail.

Steps to Reproduce:
1. Run the e2e job for single node.
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod

Here is an example:

```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880 1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

Actual results:
The IP reconciliation fails with the error

"failed to retrieve all IP pools: context deadline exceeded"

Expected results:
The job should not fail.

Additional info:

— Additional comment from Douglas Smith on 2022-01-31 22:11:50 UTC —

I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

The logs I got were:

```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

However – if I create a job from the cronjob manually, the jobs completes successfully, e.g.

```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```

I see the job complete like so:

```
$ oc get pods | grep -iP "name|testrun"
NAME READY STATUS RESTARTS AGE
testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s
```

This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

Assignee:: Douglas Smith

Reporter:: Ken Zhang

QA Contact:: Weibin Liang

Contributing Groups:: Red Hat Employee

Need Info From:: Aniket Bhat

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2022/02/07 5:06 PM

Updated:: 2023/04/12 1:49 PM

Resolved:: 2023/04/12 1:49 PM

Details

Description

Attachments

Activity

People

Dates