Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9972

Azure; NLB; OVN-K: Requests from CNI pods to internalAPI server domain fails intermittently

XMLWordPrintable

    • Moderate
    • No
    • SDN Sprint 234, SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239, SDN Sprint 240, SDN Sprint 241
    • 8
    • Approved
    • False
    • Hide

      SNO installation on 4.14 fails in Azure/GCP

      Show
      SNO installation on 4.14 fails in Azure/GCP

      Description of problem:

      OpenShift Container Platform 4.12.5 installation with IPI installation method on Microsoft Azure is showing undesired behavior when trying to curl "https://api.<clustername>.<domain>:6443/readyz". When using `HostNetwork` it all works without any issues. But when doing the same request from a pod that does not have `HostNetwork` capabilties and therefore has an IP from the SDN range, a big portion of the requests is failing.
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.5    True        False         29m     Cluster version is 4.12.5
      
      $ oc get network cluster -o yaml
      apiVersion: config.openshift.io/v1
      kind: Network
      metadata:
        creationTimestamp: "2023-03-10T13:12:06Z"
        generation: 2
        name: cluster
        resourceVersion: "2975"
        uid: e1e9c464-526c-4ebf-ab84-0deedf092cac
      spec:
        clusterNetwork:
        - cidr: 10.128.0.0/14
          hostPrefix: 23
        externalIP:
          policy: {}
        networkType: OVNKubernetes
        serviceNetwork:
        - 172.30.0.0/16
      status:
        clusterNetwork:
        - cidr: 10.128.0.0/14
          hostPrefix: 23
        clusterNetworkMTU: 1400
        networkType: OVNKubernetes
        serviceNetwork:
        - 172.30.0.0/16
      
      $ oc get infrastructure cluster -o yaml
      apiVersion: config.openshift.io/v1
      kind: Infrastructure
      metadata:
        creationTimestamp: "2023-03-10T13:12:04Z"
        generation: 1
        name: cluster
        resourceVersion: "430"
        uid: 5c260276-d901-40f7-a28c-172c492e81e6
      spec:
        cloudConfig:
          key: config
          name: cloud-provider-config
        platformSpec:
          type: Azure
      status:
        apiServerInternalURI: https://api-int.clustername.domain.lab:6443
        apiServerURL: https://api.clustername.domain.lab:6443
        controlPlaneTopology: HighlyAvailable
        etcdDiscoveryDomain: ""
        infrastructureName: sreberazure-njj24
        infrastructureTopology: HighlyAvailable
        platform: Azure
        platformStatus:
          azure:
            cloudName: AzurePublicCloud
            networkResourceGroupName: sreberazure-njj24-rg
            resourceGroupName: sreberazure-njj24-rg
          type: Azure
      
      $ oc project openshift-apiserver
      Already on project "openshift-apiserver" on server "https://api.clustername.domain.lab:6443".
      $ oc get pod
      NAME                         READY   STATUS    RESTARTS   AGE
      apiserver-6f58784797-kq4kr   2/2     Running   0          41m
      apiserver-6f58784797-l69jr   2/2     Running   0          38m
      apiserver-6f58784797-nn6tn   2/2     Running   0          45m
      
      $ oc get pod -o wide
      NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
      apiserver-6f58784797-kq4kr   2/2     Running   0          42m   10.130.0.21   sreberazure-njj24-master-0   <none>           <none>
      apiserver-6f58784797-l69jr   2/2     Running   0          38m   10.129.0.29   sreberazure-njj24-master-2   <none>           <none>
      apiserver-6f58784797-nn6tn   2/2     Running   0          45m   10.128.0.36   sreberazure-njj24-master-1   <none>           <none>
      
      $ oc rsh apiserver-6f58784797-l69jr
      Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
      sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
      curl: (28) Connection timed out after 1000 milliseconds
      okokokcurl: (28) Connection timed out after 1001 milliseconds
      okokcurl: (28) Connection timed out after 1003 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
      okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds
      okokcurl: (28) Connection timed out after 1001 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
      ^C
      sh-4.4# exit
      exit
      command terminated with exit code 130
      
      $ oc project openshift-kube-apiserver
      Now using project "openshift-kube-apiserver" on server "https://api.clustername.domain.lab:6443".
      
      $ oc get pod -o wide
      NAME                                              READY   STATUS      RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
      apiserver-watcher-sreberazure-njj24-master-0      1/1     Running     0          55m   10.0.0.6      sreberazure-njj24-master-0   <none>           <none>
      apiserver-watcher-sreberazure-njj24-master-1      1/1     Running     0          57m   10.0.0.8      sreberazure-njj24-master-1   <none>           <none>
      apiserver-watcher-sreberazure-njj24-master-2      1/1     Running     0          57m   10.0.0.7      sreberazure-njj24-master-2   <none>           <none>
      installer-2-sreberazure-njj24-master-2            0/1     Completed   0          51m   10.129.0.27   sreberazure-njj24-master-2   <none>           <none>
      installer-3-sreberazure-njj24-master-2            0/1     Completed   0          50m   10.129.0.32   sreberazure-njj24-master-2   <none>           <none>
      installer-4-sreberazure-njj24-master-2            0/1     Completed   0          49m   10.129.0.36   sreberazure-njj24-master-2   <none>           <none>
      installer-5-sreberazure-njj24-master-2            0/1     Completed   0          46m   10.129.0.15   sreberazure-njj24-master-2   <none>           <none>
      installer-6-sreberazure-njj24-master-0            0/1     Completed   0          37m   10.130.0.27   sreberazure-njj24-master-0   <none>           <none>
      installer-6-sreberazure-njj24-master-1            0/1     Completed   0          39m   10.128.0.45   sreberazure-njj24-master-1   <none>           <none>
      installer-6-sreberazure-njj24-master-2            0/1     Completed   0          36m   10.129.0.37   sreberazure-njj24-master-2   <none>           <none>
      kube-apiserver-guard-sreberazure-njj24-master-0   1/1     Running     0          37m   10.130.0.29   sreberazure-njj24-master-0   <none>           <none>
      kube-apiserver-guard-sreberazure-njj24-master-1   1/1     Running     0          38m   10.128.0.47   sreberazure-njj24-master-1   <none>           <none>
      kube-apiserver-guard-sreberazure-njj24-master-2   1/1     Running     0          50m   10.129.0.31   sreberazure-njj24-master-2   <none>           <none>
      kube-apiserver-sreberazure-njj24-master-0         5/5     Running     0          37m   10.0.0.6      sreberazure-njj24-master-0   <none>           <none>
      kube-apiserver-sreberazure-njj24-master-1         5/5     Running     0          38m   10.0.0.8      sreberazure-njj24-master-1   <none>           <none>
      kube-apiserver-sreberazure-njj24-master-2         5/5     Running     0          34m   10.0.0.7      sreberazure-njj24-master-2   <none>           <none>
      revision-pruner-6-sreberazure-njj24-master-0      0/1     Completed   0          33m   10.130.0.35   sreberazure-njj24-master-0   <none>           <none>
      revision-pruner-6-sreberazure-njj24-master-1      0/1     Completed   0          33m   10.128.0.56   sreberazure-njj24-master-1   <none>           <none>
      revision-pruner-6-sreberazure-njj24-master-2      0/1     Completed   0          33m   10.129.0.39   sreberazure-njj24-master-2   <none>           <none>
      
      $ oc rsh kube-apiserver-sreberazure-njj24-master-1
      sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
      okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok
      
      Also changing  `--connect-timeout 1` from curl to `--connect-timeout 10` for example does not have any impact. It simply takes longer until the timeout is reached.

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.12 (also previous version were not tested)

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install OpenShift Container Platform 4.12 on Azure using IPI install method and set the SDN to OVN-Kubernetes
      2. Once successfully installed run `oc project openshift-apiserver`
      3. rsh apiserver-<podID>
      4. while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done

      Actual results:

      sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
      curl: (28) Connection timed out after 1000 milliseconds
      okokokcurl: (28) Connection timed out after 1001 milliseconds
      okokcurl: (28) Connection timed out after 1003 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
      okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds
      okokcurl: (28) Connection timed out after 1001 milliseconds
      curl: (28) Connection timed out after 1001 milliseconds
       

      Expected results:

      sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
      okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok
       

      Additional info:

       

            sseethar Surya Seetharaman
            rhn-support-sreber Simon Reber
            Huiran Wang Huiran Wang
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: