Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5140

[alibabacloud] IPI install got bootstrap failure and without any node ready, due to enforced EIP bandwidth 5 Mbit/s

    XMLWordPrintable

Details

    • Critical
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      The IPI installation in some regions got bootstrap failure, and without any node available/ready.

      Version-Release number of selected component (if applicable):

      12-22 16:22:27.970  ./openshift-install 4.12.0-0.nightly-2022-12-21-202045
      12-22 16:22:27.970  built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9
      12-22 16:22:27.970  release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270
      12-22 16:22:27.971  release architecture amd64
      

      How reproducible:

      Always

      Steps to Reproduce:

      1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing) 

      Actual results:

      Bootstrap failed to complete

      Expected results:

      Installation in those regions should succeed.

      Additional info:

      FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/
      
      No any node available/ready, and no any operator available.
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          30m     Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors
      $ oc get nodes
      No resources found
      $ oc get machines -n openshift-machine-api -o wide
      NAME                         PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
      jiwei-1222f-v729x-master-0                                  30m                       
      jiwei-1222f-v729x-master-1                                  30m                       
      jiwei-1222f-v729x-master-2                                  30m                       
      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication
      baremetal
      cloud-controller-manager                                                                          
      cloud-credential                                                                                  
      cluster-autoscaler                                                                                
      config-operator                                                                                   
      console                                                                                           
      control-plane-machine-set                                                                         
      csi-snapshot-controller                                                                           
      dns                                                                                               
      etcd                                                                                              
      image-registry                                                                                    
      ingress                                                                                           
      insights                                                                                          
      kube-apiserver                                                                                    
      kube-controller-manager                                                                           
      kube-scheduler                                                                                    
      kube-storage-version-migrator                                                                     
      machine-api                                                                                       
      machine-approver                                                                                  
      machine-config                                                                                    
      marketplace                                                                                       
      monitoring                                                                                        
      network                                                                                           
      node-tuning                                                                                       
      openshift-apiserver                                                                               
      openshift-controller-manager                                                                      
      openshift-samples                                                                                 
      operator-lifecycle-manager                                                                        
      operator-lifecycle-manager-catalog                                                                
      operator-lifecycle-manager-packageserver
      service-ca
      storage
      $
      
      Mater nodes don't run for example kubelet and crio services.
      [core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps
      FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
      [core@jiwei-1222f-v729x-master-0 ~]$ 
      
      The machine-config-daemon firstboot tells "failed to update OS".
      [jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log 
      Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
      Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
      Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking
      Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
      Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
      Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info>  [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory
      Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn>  [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
      Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused
      ...repeated logs omitted...
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found.
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found.
      Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found.
      Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
      Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
      Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425    2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units.
      Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
      Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
      Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never
      Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory
      Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported
      Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'.
      Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube.
      Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'.
      [jiwei@jiwei log-bundle-20221222085846]$ 
      

       

      Attachments

        1. .openshift_install.log
          335 kB
        2. log-bundle-20221222085846.tar.gz
          1.32 MB
        3. log-bundle-20230128114109.tar.gz
          1.35 MB
        4. openshift_install.log
          4 kB
        5. openshift_install.stdout
          71 kB

        Issue Links

          Activity

            People

              bteng@redhat.com Bo Teng
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: