Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-4644

hypershift on BM :: Canary route checks for the default ingress controller are failing

XMLWordPrintable

    • False
    • None
    • False
    • OCPSTRAT-618 - [GA] Self-managed Hosted Control Planes support for BM using the Agent Provider
    • No
    • AI-30

      Description of the problem:

      As part of the ecosystem QE testing on hypershift clusters on real baremetals, we tried to deploy several times with several environments an hypershift cluster with 1 or 2 workers.

      while the cluster is created successfully, we never succeeded to arrive to a place where the ingress operator is functioning well, although we set up all the necessary dns entries (A records) - the route to the dns is never found, even if we can ping to it:

      $ oc get co
      NAME                    VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE  MESSAGE
      console                  4.12.9  False    False     False   119m  RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.registry.ocp-edge4.lab.eng.tlv2.redhat.com): Get "https://console-openshift-console.apps.registry.ocp-edge4.lab.eng.tlv2.redhat.com": dial tcp 10.46.29.134:443: connect: no route to host
      csi-snapshot-controller          4.12.9  True    False     False   3h16m  
      dns                    4.12.9  True    False     False   118m   
      image-registry               4.12.9  True    False     False   119m   
      ingress                  4.12.9  True    False     True    9m59s  The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

      Same error is shown when we are trying to deploy a cluster with metalLB and use one of the loadbalancer ips as the api point.

      I'm opening this incident report to track after this effort and see if there might be something we are missing here.

      I'm attaching a must-gather logs to see if something in the way ingress work on BM is different from kubevirt for example - when this is working as expected.

      also, on libvirt it is working as well, so our thoughts is that the reason for the issue is that the workers and the hub are not in the exact same network - and if that is by design, or expected, will appreciate any comments. tnx.

      Version-Release number of selected component:

      • advanced-cluster-management.v2.7.3 - 2.7.3-DOWNSTREAM-2023-03-22-19-15-09
      • 4.12.0-0.nightly-2023-03-21-173554{}

      How reproducible:

      100%

      Steps to reproduce:

      1. Deploy hub with 3 masters on real BM

      2. Install ACM and hypershift operator

      3.  Deploy hypershift cluster with 1 or 2 real bm workers 

       

      must-gather: https://drive.google.com/drive/folders/1NH0Q5nmARQ5c9P-qTKd211p-38OiNsNY?usp=sharing 

            ercohen Eran Cohen
            smiron@redhat.com Shelly Miron
            David Huynh David Huynh
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: