Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54159

cluster operators degraded when they are moved to infra nodes as route access is failing

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Rejected
    • NI&D Sprint 272, NI&D Sprint 274, NI&D Sprint 278
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      I have deployed OCP 4.19 cluster on baremetal with 22 worker nodes and 2 infra nodes using 4.19.0-ec.3. Then I have applied the OVNK BGP image which is built using the PR  build 4.19,openshift/ovn-kubernetes#2239

      After 4 to 5 hours, I see some cluster operators getting degraded
        [root@e33-h03-000-r650 debug_oc]# oc get co | grep  "False         True"
      authentication                             4.19.0-ec.3   False       False         True       37h     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.vkommadieip29.rdu2.scalelab.redhat.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      console                                    4.19.0-ec.3   False       False         True       37h     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com): Get "https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      ingress                                    4.19.0-ec.3   True        False         True       3d13h   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing. Last 1 error messages:...
      insights                                   4.19.0-ec.3   False       False         True       33h     Failed to upload data: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": dial tcp 23.40.100.203:443: i/o timeout
      kube-controller-manager                    4.19.0-ec.3   True        False         True       3d13h   GarbageCollectorDegraded: error fetching rules: client_error: client error: 401

      ingress-operator is showing below error in its logs
      2025-03-24T12:12:20.904051049Z 2025-03-24T12:12:20.903Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.vkommadieip29.rdu2.scalelab.redhat.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

      This is not happening when BGP patch is not applied. This is blocking BGP testing on baremetal deployment as prometheus is down.

       

      Version-Release number of selected component (if applicable):

      4.19.0-ec.3

      OVNK image from build 4.19,openshift/ovn-kubernetes#2239

       

      How reproducible:

      Always

       

      Steps to Reproduce:

      1. Deploy 4.19.0-ec.3 on baremetal with 24 workers. 

      2. oc patch featuregate cluster --type=merge -p='{"spec":{"featureSet":"TechPreviewNoUpgrade"}}'

      3. oc patch Network.operator.openshift.io cluster --type=merge -p='{"spec":{"additionalRoutingCapabilities":

      {"providers": ["FRR"]}

      , "defaultNetwork":{"ovnKubernetesConfig":{"routeAdvertisements":"Enabled"}}}}'

      4. Lable 2 nodes as infra and move ingress, registry and proemthus to infra nodes

      5. oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version

      oc -n openshift-network-operator set env deployment.apps/network-operator OVN_IMAGE=quay.io/vkommadi/bgppr2239ovnk:latest

      6. git clone -b ovnk-bgp https://github.com/jcaamano/frr-k8s
      cd frr-k8s/hack/demo/

      ./demo.sh

      7. oc apply -f ~/frr-k8s/hack/demo/configs/receive_all.yaml

      8. cat ~/ra.yaml
      apiVersion: k8s.ovn.org/v1
      kind: RouteAdvertisements
      metadata:
        name: default
      spec:
        networkSelector:
          matchLabels:
            k8s.ovn.org/default-network: ""
        advertisements:
        - "PodNetwork"
        - "EgressIP"

       

       oc apply -f ~/ra.yaml

      9. Wait for 5 to 6 hours, we can see some operators degraded because of health check failures (mainly ingress, prometheus, authentication, console)

       

      Actual results:

      health checks for operators are failing as route access is failing when BGP is enabled. We can't conduct scale tests as prometheus is down.

       

      Expected results:

      health checks for operators shouldn't fail.

       

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal RedHat testing failure

      If it is an internal RedHat testing failure:

       

              mmasters1@redhat.com Miciah Masters
              vkommadi@redhat.com VENKATA ANIL kumar KOMMADDI
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: