Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.0
Affects Version/s: 4.14.0
Component/s: Networking / ovn-kubernetes
Labels:
- TestBlocker
- pre-merge

Severity:
Critical
Regression:
No
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Installation cannot succeed with userLabels & userTags settings.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-02-132842

How reproducible:

Always

Steps to Reproduce:

1. "create install-config"
2. insert userLabels & userTags setting into install-config.yaml (see below)
3. make sure your GCP credential has Tag User role in the project level and organizational level
4. "create cluster"

Actual results:

The installation failed, with cluster operators authentication, console, image-registry, ingress, monitoring, olm, platform-operators-aggregated, storage are not available.

Expected results:

The installation succeeds.

Additional info:

FYI The installation succeeded with 4.14.0-0.nightly-2023-08-28-154013.

$ openshift-install version
openshift-install 4.14.0-0.nightly-2023-09-02-132842
built from commit 43cffbbdbba4e3bbc6dcbb141518b3728f401e51
release image registry.ci.openshift.org/ocp/release@sha256:87077b3b95eba15e96758d04d0b69fb0b2b1eb78a3c2269c0db9cd0df2223a12
release architecture amd64
$ yq-3.3.0 r test-lt/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
  userLabels:
  - key: createdby
    value: installer-qe
  - key: environment
    value: test
  userTags:
  - parentID: openshift-qe
    key: department
    value: engineering
  - parentID: 54643501348
    key: ocp_tag_dev
    value: foo
  - parentID: openshift-qe
    key: team
    value: 'installer qe'
$ yq-3.3.0 r test-lt/install-config.yaml credentialsMode
Passthrough
$ yq-3.3.0 r test-lt/install-config.yaml featureSet
TechPreviewNoUpgrade
$ gcloud config get account
ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com
$ gcloud config get project
openshift-qe
$ openshift-install create cluster --dir test-lt
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Consuming Install Config from target directory
WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 6:03PM CST) for the Kubernetes API at https://api.jiwei-0905l.qe.gcp.devcluster.openshift.com:6443...
INFO API v1.27.4+2c83a9f up
INFO Waiting up to 30m0s (until 6:15PM CST) for bootstrapping to complete...
INFO Destroying the bootstrap resources...        
INFO Waiting up to 40m0s (until 6:38PM CST) for the cluster at https://api.jiwei-0905l.qe.gcp.devcluster.openshift.com:6443 to initialize... 
...output omitted...
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operators authentication, console, image-registry, ingress, monitoring, olm, platform-operators-aggregated, storage are not available
$ export KUBECONFIG=test-lt/auth/kubeconfig 
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          16h     Unable to apply 4.14.0-0.nightly-2023-09-02-132842: some cluster operators are not available
$ oc get nodes
NAME                                                       STATUS   ROLES                  AGE   VERSION
jiwei-0905l-9s6f6-master-0.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
jiwei-0905l-9s6f6-master-1.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
jiwei-0905l-9s6f6-master-2.c.openshift-qe.internal         Ready    control-plane,master   16h   v1.27.4+2c83a9f
jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal   Ready    worker                 16h   v1.27.4+2c83a9f
jiwei-0905l-9s6f6-worker-b-ff8gc.c.openshift-qe.internal   Ready    worker                 16h   v1.27.4+2c83a9f
$ oc get co | grep -v 'True        False         False'
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-0.nightly-2023-09-02-132842   False       False         True       16h     OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found...
console                                    4.14.0-0.nightly-2023-09-02-132842   False       False         True       16h     RouteHealthAvailable: console route is not admitted
image-registry                                                                  False       True          True       16h     Available: The deployment does not have available replicas...
ingress                                                                         False       True          True       16h     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
kube-controller-manager                    4.14.0-0.nightly-2023-09-02-132842   True        False         True       16h     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
monitoring                                                                      False       True          True       16h     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded
network                                    4.14.0-0.nightly-2023-09-02-132842   True        True          False      16h     Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
olm                                        4.14.0-0.nightly-2023-09-02-132842   False       True          False      16h     CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment...
platform-operators-aggregated                                                                                                
storage                                    4.14.0-0.nightly-2023-09-02-132842   False       False         False      16h     SHARESCSIDriverOperatorCRAvailable: SharedResourcesDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
$ oc describe node jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
Name:               jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
Roles:              worker
...output omitted...
CreationTimestamp:  Tue, 05 Sep 2023 17:59:45 +0800
Taints:             node.kubernetes.io/network-unavailable:NoSchedule
                    UpdateInProgress:PreferNoSchedule
...output omitted...
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   True    Tue, 05 Sep 2023 17:59:47 +0800   Tue, 05 Sep 2023 17:59:47 +0800   NoRouteCreated               Node created without a route
  MemoryPressure       False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 17:59:45 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available  
  Ready                True    Wed, 06 Sep 2023 10:28:09 +0800   Tue, 05 Sep 2023 18:00:25 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.0.128.2
  Hostname:    jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
...output omitted...
$ oc debug node/jiwei-0905l-9s6f6-worker-a-jf2kg.c.openshift-qe.internal
Starting pod/jiwei-0905l-9s6f6-worker-a-jf2kgcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# ip route show
default via 10.0.128.1 dev br-ex proto dhcp src 10.0.128.2 metric 48 
10.0.128.1 dev br-ex proto dhcp scope link src 10.0.128.2 metric 48 
10.128.0.0/14 via 10.128.2.1 dev ovn-k8s-mp0 
10.128.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.2.2 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
169.254.169.1 dev br-ex src 10.0.128.2 
169.254.169.3 via 10.128.2.1 dev ovn-k8s-mp0 
172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1360 
sh-5.1# exit
exit
sh-4.4# exit
exitRemoving debug pod ...
$ oc describe co ingress | grep network
    Message:               The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-8588454847-h5t2n" cannot be scheduled: 0/5 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.. Pod "router-default-8588454847-fnghk" cannot be scheduled: 0/5 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/2 of replicas are available), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
$

blocks

OCPBUGS-19568 [gcp] installation with "featureSet: TechPreviewNoUpgrade" failed, possibly due to nodes getting taint - "node.kubernetes.io/network-unavailable"

Closed

is cloned by

OCPBUGS-19568 [gcp] installation with "featureSet: TechPreviewNoUpgrade" failed, possibly due to nodes getting taint - "node.kubernetes.io/network-unavailable"

Closed

is related to

OCPBUGS-19081 internal-registry-pull-secret.json updates and causes nodes to be stuck

Closed

links to

openshift/ovn-kubernetes#1885: OCPBUGS-18572,OCPBUGS-19331,OCPBUGS-16641: [DownstreamMerge] 9-18-23

RHEA-2023:7198 rpm

Assignee:: Jacob Tanenbaum

Reporter:: Jianli Wei

QA Contact:: Jianli Wei

Contributors:: Zhanqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/09/06 2:55 AM

Updated:: 2024/02/27 9:00 PM

Resolved:: 2024/02/27 9:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates