Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9110

IP reconciler cron job failing on single node

    XMLWordPrintable

Details

    • Important
    • Rejected
    • Unspecified
    • If docs needed, set a value

    Description

      +++ This bug was initially created as a clone of Bug #2048575 +++

      Description of problem:

      This occurs on single node on AWS.
      The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

      Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

      Please see the following Prow job for more information;
      https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

      The failure occurred before any of our test code ran, indicating that the issue was during installation.

      How reproducible:

      This is intermittent but frequent, the majority of single node Prow jobs fail.

      Steps to Reproduce:
      1. Run the e2e job for single node.
      2. Note that the job fails due to issues in the ip-resolver.
      3. Grab the logs for the failed pod

      Here is an example:

      ```
      oc logs -n openshift-multus ip-reconciler-27389955-cvl7z
      /home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
      2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880 1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
      2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
      2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
      ```

      Actual results:
      The IP reconciliation fails with the error

      "failed to retrieve all IP pools: context deadline exceeded"

      Expected results:
      The job should not fail.

      Additional info:

      — Additional comment from Douglas Smith on 2022-01-31 22:11:50 UTC —

      I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

      The logs I got were:

      ```
      $oc logs ip-reconciler-27394425-wqwdj
      I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
      2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
      2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
      ```

      However – if I create a job from the cronjob manually, the jobs completes successfully, e.g.

      ```
      $ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
      ```

      I see the job complete like so:

      ```
      $ oc get pods | grep -iP "name|testrun"
      NAME READY STATUS RESTARTS AGE
      testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s
      ```

      This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

      Attachments

        Activity

          People

            dosmith Douglas Smith
            kenzhang@redhat.com Ken Zhang
            Weibin Liang Weibin Liang
            Red Hat Employee
            Aniket Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: