Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23968

Internal NLB issue (OCPBUGS-9026) causes random failures on HCP private cluster without infra nodes

XMLWordPrintable

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-23300. The following is the description of the original issue:

      Description of problem:

      Actually the issue is same root cause of https://issues.redhat.com/browse/OCPBUGS-9026 but I'd like to open new one since the issue becomes very critical after ROSA using NLB as default since 4.14, HCP(HyperShift) private cluster that without infra nodes is the serious victim because it has worker nodes only and no available workaround for it now.
      
      But if we think we could use the old bug to track the issue, then please close this one.    
      
      
      

      Version-Release number of selected component (if applicable):

      4.14.1
      HyperShift Private cluster

      How reproducible:

      100%

      Steps to Reproduce:

      1. create ROSA HCP(HyperShift) cluster
      2. run qe-e2e-test on this cluster, or curl route from one pod inside the cluster
      3.
      

      Actual results:

      1. co/console status is flapping since route is intermittently accessible 
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.1    True        False         4h56m   Error while reconciling 4.14.1: the cluster operator console is not available
      
      
      2. check node and router pods running on both worker nodes
      $ oc get node
      NAME                          STATUS   ROLES    AGE    VERSION
      ip-10-0-49-184.ec2.internal   Ready    worker   5h5m   v1.27.6+f67aeb3
      ip-10-0-63-210.ec2.internal   Ready    worker   5h8m   v1.27.6+f67aeb3
      
      $ oc -n openshift-ingress get pod -owide
      NAME                              READY   STATUS    RESTARTS   AGE    IP           NODE                          NOMINATED NODE   READINESS GATES
      router-default-86d569bf84-bq66f   1/1     Running   0          5h8m   10.130.0.7   ip-10-0-49-184.ec2.internal   <none>           <none>
      router-default-86d569bf84-v54hp   1/1     Running   0          5h8m   10.128.0.9   ip-10-0-63-210.ec2.internal   <none>           <none>
      
      3. check ingresscontroller LB setting, it uses Internal NLB
      
      spec:
        endpointPublishingStrategy:
          loadBalancer:
            dnsManagementPolicy: Managed
            providerParameters:
              aws:
                networkLoadBalancer: {}
                type: NLB
              type: AWS
            scope: Internal
          type: LoadBalancerService
      
      4. continue to curl the route from a pod inside the cluster
      $ oc rsh console-operator-86786df488-w6fks
      Defaulted container "console-operator" out of: console-operator, conversion-webhook-server
      
      sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I
      HTTP/1.1 200 OK
      
      sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I
      Connection timed out
      
      

      Expected results:

      1. co/console should be stable, curl console route should be always OK.
      2. qe-e2e-test should not fail

      Additional info:

      qe-e2e-test on the cluster:
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/45369/rehearse-45369-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-stable-aws-rosa-sts-hypershift-sec-guest-prod-private-link-full-f2/1724307074235502592
       

            agarcial@redhat.com Alberto Garcia Lamela
            openshift-crt-jira-prow OpenShift Prow Bot
            He Liu He Liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: