Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17841

GCP SNO installation fails because redirect ipt doesn't take effect on SGW

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 241, SDN Sprint 242, SDN Sprint 243
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      I tried upgrading a 4.14 SNO cluster from one nightly image to another and, while on AWS the upgrade works fine, it fails on GCP.

      Cluster Network Operator successfully upgrades ovn-kubernetes, but is stuck on cloud network config controller, which is on crash loop back off state because it receives a wrong IP address from the name server when trying to reach the API server. The node IP is actually 10.0.0.3 and the name server returns 10.0.0.2, which I suspect is the bootstrap node IP, but that's only my guess.

      Some relevant logs:

       

      $ oc get co network
      network                                    4.14.0-0.nightly-2023-08-15-200133   True        True          False      86m     Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is not available (awaiting 1 nodes)
      
      $ oc get pods -n openshift-ovn-kubernetes -o wide
      NAME                                     READY   STATUS    RESTARTS       AGE   IP         NODE                                 NOMINATED NODE   READINESS GATES ovnkube-control-plane-844c8f76fb-q4tvp   2/2     Running   3              24m   10.0.0.3   ci-ln-rij2p1b-72292-xmzf4-master-0   <none>           <none> ovnkube-node-24kb7                       10/10   Running   12 (13m ago)   25m   10.0.0.3   ci-ln-rij2p1b-72292-xmzf4-master-0   <none>           <none>
      
      $ oc get pods -n openshift-cloud-network-config-controller -o wide
      openshift-cloud-network-config-controller          cloud-network-config-controller-d65ccbc5b-dnt69               0/1     CrashLoopBackOff   15 (2m37s ago)   40m    10.128.0.141   ci-ln-rij2p1b-72292-xmzf4-master-0   <none>           <none>
      
      $ oc logs -n openshift-cloud-network-config-controller          cloud-network-config-controller-d65ccbc5b-dnt69  W0816 11:06:00.666825       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work. F0816 11:06:30.673952       1 main.go:345] Error building controller runtime client: Get "https://api-int.ci-ln-rij2p1b-72292.gcp-2.ci.openshift.org:6443/api?timeout=32s": dial tcp 10.0.0.2:6443: i/o timeout

       

      I also get 10.0.0.2 if I run a DNS query from the node itself or from a pod:

      dig api-int.ci-ln-zp7dbyt-72292.gcp-2.ci.openshift.org
      ...
      ;; ANSWER SECTION:
      api-int.ci-ln-zp7dbyt-72292.gcp-2.ci.openshift.org. 60 IN A 10.0.0.2

       

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      Always.

      Steps to Reproduce:

      1.on clusterbot: launch 4.14 gcp,single-node
      2. on a terminal: oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-08-15-200133 --allow-explicit-upgrade --force
      

      Actual results:

      name server returns 10.0.0.2, so CNCC fails to reach the API server

      Expected results:

      name server should return 10.0.0.3

       

      Must-gather: https://drive.google.com/file/d/1MDbsMgIQz7dE6e76z4ad95dwaxbSNrJM/view?usp=sharing

      I'm assigning this bug first to the network edge team for a first pass. Please do reassign it if necessary.

       

            sseethar Surya Seetharaman
            rravaiol@redhat.com Riccardo Ravaioli
            Huiran Wang Huiran Wang
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: