Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23300

Internal NLB issue (OCPBUGS-9026) causes random failures on HCP private cluster without infra nodes

XMLWordPrintable

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, if the console Operator and Ingress pods were located on the same node, the console Operator would fail and mark the console cluster Operator as unavailable. With this release, if the console Operator and Ingress pods are located on the same node, the console operator no longer fails. (link:https://issues.redhat.com/browse/OCPBUGS-23300[*OCPBUGS-23300*])
      Show
      * Previously, if the console Operator and Ingress pods were located on the same node, the console Operator would fail and mark the console cluster Operator as unavailable. With this release, if the console Operator and Ingress pods are located on the same node, the console operator no longer fails. (link: https://issues.redhat.com/browse/OCPBUGS-23300 [* OCPBUGS-23300 *])
    • Bug Fix
    • Done

      Description of problem:

      Actually the issue is same root cause of https://issues.redhat.com/browse/OCPBUGS-9026 but I'd like to open new one since the issue becomes very critical after ROSA using NLB as default since 4.14, HCP(HyperShift) private cluster that without infra nodes is the serious victim because it has worker nodes only and no available workaround for it now.
      
      But if we think we could use the old bug to track the issue, then please close this one.    
      
      
      

      Version-Release number of selected component (if applicable):

      4.14.1
      HyperShift Private cluster

      How reproducible:

      100%

      Steps to Reproduce:

      1. create ROSA HCP(HyperShift) cluster
      2. run qe-e2e-test on this cluster, or curl route from one pod inside the cluster
      3.
      

      Actual results:

      1. co/console status is flapping since route is intermittently accessible 
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.1    True        False         4h56m   Error while reconciling 4.14.1: the cluster operator console is not available
      
      
      2. check node and router pods running on both worker nodes
      $ oc get node
      NAME                          STATUS   ROLES    AGE    VERSION
      ip-10-0-49-184.ec2.internal   Ready    worker   5h5m   v1.27.6+f67aeb3
      ip-10-0-63-210.ec2.internal   Ready    worker   5h8m   v1.27.6+f67aeb3
      
      $ oc -n openshift-ingress get pod -owide
      NAME                              READY   STATUS    RESTARTS   AGE    IP           NODE                          NOMINATED NODE   READINESS GATES
      router-default-86d569bf84-bq66f   1/1     Running   0          5h8m   10.130.0.7   ip-10-0-49-184.ec2.internal   <none>           <none>
      router-default-86d569bf84-v54hp   1/1     Running   0          5h8m   10.128.0.9   ip-10-0-63-210.ec2.internal   <none>           <none>
      
      3. check ingresscontroller LB setting, it uses Internal NLB
      
      spec:
        endpointPublishingStrategy:
          loadBalancer:
            dnsManagementPolicy: Managed
            providerParameters:
              aws:
                networkLoadBalancer: {}
                type: NLB
              type: AWS
            scope: Internal
          type: LoadBalancerService
      
      4. continue to curl the route from a pod inside the cluster
      $ oc rsh console-operator-86786df488-w6fks
      Defaulted container "console-operator" out of: console-operator, conversion-webhook-server
      
      sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I
      HTTP/1.1 200 OK
      
      sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I
      Connection timed out
      
      

      Expected results:

      1. co/console should be stable, curl console route should be always OK.
      2. qe-e2e-test should not fail

      Additional info:

      qe-e2e-test on the cluster:
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/45369/rehearse-45369-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-stable-aws-rosa-sts-hypershift-sec-guest-prod-private-link-full-f2/1724307074235502592
       

              jhadvig@redhat.com Jakub Hadvig
              rhn-support-hongli Hongan Li
              He Liu He Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated:
                Resolved: