Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3277

Install failure in create-cluster-and-infraenv.service

    XMLWordPrintable

Details

    • Agent Sprint 227, Agent Sprint 228
    • 2
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required

    Description

      I saw this occur one time when running installs in a continuous loop. This was with COMPaCT_IPV4 in a non-disconnected setup.

      WaitForBootrapComplete shows it can't access the API

      level=info msg=Unable to retrieve cluster metadata from Agent Rest API: no clusterID known for the cluster
      level=debug msg=cluster is not registered in rest API
      level=debug msg=infraenv is not registered in rest API

      This is because create-cluster-and-infraenv.service failed

      Failed Units: 2
        create-cluster-and-infraenv.service
        NetworkManager-wait-online.service

      The agentbasedinstaller register command wasn't able to retrieve the image to get the version

      Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"
      Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="failed to get image openshift version from release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451" error="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"
      

      This occurs when attempting to get the release here:
      https://github.com/openshift/assisted-service/blob/master/cmd/agentbasedinstaller/register.go#L58

       

      We should add a retry mechanism or restart the service to handle spurious network failures like this.

       

       

      Attachments

        Issue Links

          Activity

            People

              bfournie@redhat.com Robert Fournier
              bfournie@redhat.com Robert Fournier
              zhenying niu zhenying niu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: