Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-855

Intervals from node logs can block multiple critical job run artifacts from being published

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      Found in: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681121771524096

      Specifically the build log: https://storage.googleapis.com/origin-ci-test/logs/openshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681121771524096/build-log.txt

      We got no conformance intervals chart here because of this error in the log:

      Collection of node logs and analysis took: 30.078488779s
      Suite run returned error: InsertIntervalsFromClusterError: [error trying to reach service: dial tcp 10.0.128.6:10250: i/o timeout, error trying to reach service: dial tcp 10.0.128.4:10250: i/o timeout]
      error: InsertIntervalsFromClusterError: [error trying to reach service: dial tcp 10.0.128.6:10250: i/o timeout, error trying to reach service: dial tcp 10.0.128.4:10250: i/o timeout]
      Shutting down SimultaneousPodIPController
      SimultaneousPodIPController shut down
      

      Which in InsertIntervalsFromCluster is trying to gather node logs. In this job we had two nodes stop responding, thus trying to get their node logs fails. In options_monitor_events.go when we call this Intervals func, we just return if there's an error.

      The code must be updated to log and continue as best it can.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: