Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19568

[gcp] installation with "featureSet: TechPreviewNoUpgrade" failed, possibly due to nodes getting taint - "node.kubernetes.io/network-unavailable"

XMLWordPrintable

    • Critical
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Installation cannot succeed with userLabels & userTags settings.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-09-02-132842

      How reproducible:

      Always

      Steps to Reproduce:

      1. "create install-config"
      2. insert userLabels & userTags setting into install-config.yaml (see below)
      3. make sure your GCP credential has Tag User role in the project level and organizational level
      4. "create cluster" 

      Actual results:

      The installation failed, with cluster operators authentication, console, image-registry, ingress, monitoring, olm, platform-operators-aggregated, storage are not available.

      Expected results:

      The installation succeeds.

      Additional info:

      FYI The installation succeeded with 4.14.0-0.nightly-2023-08-28-154013.
      
      $ openshift-install version
      openshift-install 4.14.0-0.nightly-2023-09-02-132842
      built from commit 43cffbbdbba4e3bbc6dcbb141518b3728f401e51
      release image registry.ci.openshift.org/ocp/release@sha256:87077b3b95eba15e96758d04d0b69fb0b2b1eb78a3c2269c0db9cd0df2223a12
      release architecture amd64
      $ yq-3.3.0 r test-lt/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
        userLabels:
        - key: createdby
          value: installer-qe
        - key: environment
          value: test
        userTags:
        - parentID: openshift-qe
          key: department
          value: engineering
        - parentID: 54643501348
          key: ocp_tag_dev
          value: foo
        - parentID: openshift-qe
          key: team
          value: 'installer qe'
      $ yq-3.3.0 r test-lt/install-config.yaml credentialsMode
      Passthrough
      $ yq-3.3.0 r test-lt/install-config.yaml featureSet
      TechPreviewNoUpgrade
      $ gcloud config get account
      ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com
      $ gcloud config get project
      openshift-qe
      $ openshift-install create cluster --dir test-lt
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      INFO Consuming Install Config from target directory
      WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
      INFO Creating infrastructure resources...
      INFO Waiting up to 20m0s (until 6:03PM CST) for the Kubernetes API at https://api.jiwei-0905l.qe.gcp.devcluster.openshift.com:6443...
      INFO API v1.27.4+2c83a9f up
      INFO Waiting up to 30m0s (until 6:15PM CST) for bootstrapping to complete...
      INFO Destroying the bootstrap resources...        
      INFO Waiting up to 40m0s (until 6:38PM CST) for the cluster at https://api.jiwei-0905l.qe.gcp.devcluster.openshift.com:6443 to initialize... 
      ...output omitted...
      ERROR Cluster initialization failed because one or more operators are not functioning properly.
      ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
      ERROR failed to initialize the cluster: Cluster operators authentication, console, image-registry, ingress, monitoring, olm, platform-operators-aggregated, storage are not available
      $ export KUBECONFIG=test-lt/auth/kubeconfig 
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          16h     Unable to apply 4.14.0-0.nightly-2023-09-02-132842: some cluster operators are not available
      $ oc get nodes
      NAME                                                       STATUS   ROLES                  AGE   VERSION
      jiwei-0905l-9s6f6-master-0.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
      jiwei-0905l-9s6f6-master-1.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
      jiwei-0905l-9s6f6-master-2.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
      jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal   Ready    worker                 16h   v1.27.4+2c83a9f
      jiwei-0905l-9s6f6-worker-b-ff8gc.c.openshift-qe.internal   Ready    worker                 16h   v1.27.4+2c83a9f
      $ oc get co | grep -v 'True        False         False'
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.0-0.nightly-2023-09-02-132842   False       False         True       16h     OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found...
      console                                    4.14.0-0.nightly-2023-09-02-132842   False       False         True       16h     RouteHealthAvailable: console route is not admitted
      image-registry                                                                  False       True          True       16h     Available: The deployment does not have available replicas...
      ingress                                                                         False       True          True       16h     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
      kube-controller-manager                    4.14.0-0.nightly-2023-09-02-132842   True        False         True       16h     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
      monitoring                                                                      False       True          True       16h     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded
      network                                    4.14.0-0.nightly-2023-09-02-132842   True        True          False      16h     Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
      olm                                        4.14.0-0.nightly-2023-09-02-132842   False       True          False      16h     CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment...
      platform-operators-aggregated                                                                                                
      storage                                    4.14.0-0.nightly-2023-09-02-132842   False       False         False      16h     SHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
      $ oc describe node jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
      Name:               jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
      Roles:              worker
      ...output omitted...
      CreationTimestamp:  Tue, 05 Sep 2023 17:59:45 +0800
      Taints:             node.kubernetes.io/network-unavailable:NoSchedule
                          UpdateInProgress:PreferNoSchedule
      ...output omitted...
      Conditions:
        Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
        ----                 ------  -----------------                 ------------------                ------                       -------
        NetworkUnavailable   True    Tue, 05 Sep 2023 17:59:47 +0800   Tue, 05 Sep 2023 17:59:47 +0800   NoRouteCreated               Node created without a route
        MemoryPressure       False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
        DiskPressure         False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
        PIDPressure          False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available  
        Ready                True    Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 18:00:25 +0800   KubeletReady                 kubelet is posting ready status
      Addresses:
        InternalIP:  10.0.128.2
        Hostname:    jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
      ...output omitted...
      $ oc debug node/jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
      Starting pod/jiwei-0905l-9s6f6-worker-a-jf2kgcopenshift-qeinternal-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.128.2
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-5.1# ip route show
      default via 10.0.128.1 dev br-ex proto dhcp src 10.0.128.2 metric 48 
      10.0.128.1 dev br-ex proto dhcp scope link src 10.0.128.2 metric 48 
      10.128.0.0/14 via 10.128.2.1 dev ovn-k8s-mp0 
      10.128.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.2.2 
      169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
      169.254.169.1 dev br-ex src 10.0.128.2 
      169.254.169.3 via 10.128.2.1 dev ovn-k8s-mp0 
      172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1360 
      sh-5.1# exit
      exit
      sh-4.4# exit
      exitRemoving debug pod ...
      $ oc describe co ingress | grep network
          Message:               The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-8588454847-h5t2n" cannot be scheduled: 0/5 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.. Pod "router-default-8588454847-fnghk" cannot be scheduled: 0/5 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/2 of replicas are available), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
      $ 

            jtanenba@redhat.com Jacob Tanenbaum
            rhn-support-jiwei Jianli Wei
            Jianli Wei Jianli Wei
            Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: