Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4455

[RHEL scale up] increase the wait time so that the node has enough time to get ready

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.9.z
    • None

      This bug is a backport clone of [Bugzilla Bug 2090151](https://bugzilla.redhat.com/show_bug.cgi?id=2090151). The following is the description of the original bug:

      Version :
      4.9.0-0.nightly-2022-05-24-200205

      Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy.

      TASK [openshift_node : Wait for node to report ready] **************************
      Wednesday 25 May 2022 14:25:10 +0800 (0:00:19.202) 0:13:32.778 *********
      FAILED - RETRYING: Wait for node to report ready (30 retries left).
      <-SNIP->
      FAILED - RETRYING: Wait for node to report ready (1 retries left).
      fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "-kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "-output=jsonpath=

      {.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
      fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "-kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "-output=jsonpath={.status.conditions[?(@.type=="Ready")].status}

      "], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}

      The timeline is:

      1.[6:24-6:34] Approve CSR and wait for 10 min
      TASK [openshift_node : Approve node CSRs] **************************************
      Wednesday 25 May 2022 14:24:51 +0800 (0:04:04.743) 0:13:13.576 *********

      2.[6:34], scale-up up job reported error, time out

      3.[6:37:09], node reported Ready
      May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219 2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in ternal" event="NodeReady"

      • lastHeartbeatTime: "2022-05-25T07:16:01Z"
        lastTransitionTime: "2022-05-25T06:37:09Z"
        message: kubelet is posting ready status
        reason: KubeletReady
        status: "True"
        type: Ready

      How to reproduce it (as minimally and precisely as possible)?
      > 30%

      Steps to Reproduce:
      1. Create a cluster with OVN network
      2. Do scale up against above cluster

      Expected results:
      Scale-up job finished successfully

      Suggestion:
      Increase wait time to 16-18 mins.

      Additional info:
      this issue is applicable for 4.9 4.10 and 4.11

            Unassigned Unassigned
            openshift-crt-jira-prow OpenShift Prow Bot
            Gaoyun Pei Gaoyun Pei
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: