Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21898

Ingress stuck in progressing when maxConnections increased to 2000000

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Critical
    • 4.14.0
    • 4.14, 4.14.z, 4.15
    • Networking / router
    • None
    • 2
    • Sprint 243
    • 1
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      In this release, we are introducing a necessary configuration
      adjustment in response to the transition to HAProxy 2.6. This change addresses a specific issue related to 'maxconn' resource limits.

      Change Details:

      - Adjusted HAProxy behavior: Switch to using 'no strict-limits' in the haproxy configuration, HAProxy will no longer exit fatally when 'maxconn' cannot be satisfied. Instead, it will emit warnings and continue running.

      Warnings and Resolution:

      In some cases, when 'maxconn' limits cannot be met, HAProxy may emit warnings like the following:

      {noformat}
      [WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576.
      [ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble.
      {noformat}

      To resolve these warnings, we recommend specifying '-1' (or 'auto') for the maxConnections when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limits in the running container, eliminating these warnings.

      Important Note:

      Please be aware that the 'strict-limits' setting is not configurable
      by end users and remains under the control of the HAProxy template.

      Impact:

      The transition to HAProxy 2.6 brought about 'strict-limits' enforcement, which resulted in fatal errors when 'maxconn' requirements couldn't be met. The introduction of 'no strict-limits' ensures that HAProxy continues to operate, emitting warnings instead of causing fatal errors when 'maxconn' limits are exceeded.
      Show
      In this release, we are introducing a necessary configuration adjustment in response to the transition to HAProxy 2.6. This change addresses a specific issue related to 'maxconn' resource limits. Change Details: - Adjusted HAProxy behavior: Switch to using 'no strict-limits' in the haproxy configuration, HAProxy will no longer exit fatally when 'maxconn' cannot be satisfied. Instead, it will emit warnings and continue running. Warnings and Resolution: In some cases, when 'maxconn' limits cannot be met, HAProxy may emit warnings like the following: {noformat} [WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576. [ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble. {noformat} To resolve these warnings, we recommend specifying '-1' (or 'auto') for the maxConnections when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limits in the running container, eliminating these warnings. Important Note: Please be aware that the 'strict-limits' setting is not configurable by end users and remains under the control of the HAProxy template. Impact: The transition to HAProxy 2.6 brought about 'strict-limits' enforcement, which resulted in fatal errors when 'maxconn' requirements couldn't be met. The introduction of 'no strict-limits' ensures that HAProxy continues to operate, emitting warnings instead of causing fatal errors when 'maxconn' limits are exceeded.
    • Bug Fix

    Description

      This is a clone of issue OCPBUGS-21803. The following is the description of the original issue:

      Description of problem:

      The test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926 was created for NE-577 epic. When we increase the 'spec.tuningOptions.maxConnections' to 200000, the default ingress controller stuck in progressing.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926

      Steps to Reproduce:

      1.Edit the defualt controller with max value 2000000oc -n openshift-ingress-operator edit ingresscontroller defaulttuningOptions:
          maxConnections: 2000000
      2.melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml | grep  -A1 tuningOptions
        tuningOptions:
          maxConnections: 2000000
      3. melvinjoseph@mjoseph-mac openshift-tests-private % oc get co/ingress 
      NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      ingress   4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h42m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
      

      Actual results:

      The default ingress controller stuck in progressing

      Expected results:

      The ingress controller should work as normal

      Additional info:

      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
      NAME                              READY   STATUS        RESTARTS   AGE
      router-default-7cf67f448-gb7mr    0/1     Running       0          38s
      router-default-7cf67f448-qmvks    0/1     Running       0          38s
      router-default-7dcd556587-kvk8d   0/1     Terminating   0          3h53m
      router-default-7dcd556587-vppk4   1/1     Running       0          3h53m
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
      NAME                              READY   STATUS    RESTARTS   AGE
      router-default-7cf67f448-gb7mr    0/1     Running   0          111s
      router-default-7cf67f448-qmvks    0/1     Running   0          111s
      router-default-7dcd556587-vppk4   1/1     Running   0          3h55m
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h28m   
      baremetal                                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      cloud-controller-manager                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h58m   
      cloud-credential                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h59m   
      cluster-autoscaler                         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      config-operator                            4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      console                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h34m   
      control-plane-machine-set                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      csi-snapshot-controller                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      dns                                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      etcd                                       4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h47m   
      image-registry                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      176m    
      ingress                                    4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h39m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
      insights                                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h49m   
      kube-apiserver                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
      kube-controller-manager                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
      kube-scheduler                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
      kube-storage-version-migrator              4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      machine-api                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
      machine-approver                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      machine-config                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h53m   
      marketplace                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      monitoring                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h35m   
      network                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h57m   
      node-tuning                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      openshift-apiserver                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      openshift-controller-manager               4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      openshift-samples                          4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      operator-lifecycle-manager                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      operator-lifecycle-manager-catalog         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      operator-lifecycle-manager-packageserver   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      service-ca                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      storage                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h36m   
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get po
      NAME                               READY   STATUS    RESTARTS        AGE
      ingress-operator-c6fd989fd-jsrzv   2/2     Running   4 (3h45m ago)   3h58m
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator logs ingress-operator-c6fd989fd-jsrzv -c ingress-operator --tail=20
      2023-10-17T11:34:54.327Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.394Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.394Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.394Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.397Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.429Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.446Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.557Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.9999758s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      2023-10-17T11:34:54.558Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.583Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.657Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.345629987s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      2023-10-17T11:34:54.794Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:36:11.248Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "58m42.755479533s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
       
      melvinjoseph@mjoseph-mac openshift-tests-private % oc get po -n openshift-ingress
      NAME                              READY   STATUS    RESTARTS      AGE
      router-default-7cf67f448-gb7mr    0/1     Running   1 (71s ago)   3m57s
      router-default-7cf67f448-qmvks    0/1     Running   1 (70s ago)   3m57s
      router-default-7dcd556587-vppk4   1/1     Running   0             3h57m
      
      melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-gb7mr --tail=20 
      I1017 11:39:22.623928       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:23.623924       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:24.623373       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:25.627359       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:26.623337       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:27.623603       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:28.623866       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:29.623183       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:30.623475       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:31.623949       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-qmvks --tail=20
      I1017 11:39:34.553475       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:35.551412       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:36.551421       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      E1017 11:39:37.052068       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
      I1017 11:39:37.551648       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:38.551632       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:39.551410       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:40.552620       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:41.552050       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:42.551076       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:42.564293       1 template.go:828] router "msg"="Shutdown requested, waiting 45s for new connections to cease" 
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller 
      NAME      AGE
      default   3h59m
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml
      apiVersion: operator.openshift.io/v1
      <-----snip---->
      status:
        availableReplicas: 1
        conditions:
        - lastTransitionTime: "2023-10-17T07:41:42Z"
          reason: Valid
          status: "True"
          type: Admitted
        - lastTransitionTime: "2023-10-17T07:57:01Z"
          message: The deployment has Available status condition set to True
          reason: DeploymentAvailable
          status: "True"
          type: DeploymentAvailable
        - lastTransitionTime: "2023-10-17T07:57:01Z"
          message: Minimum replicas requirement is met
          reason: DeploymentMinimumReplicasMet
          status: "True"
          type: DeploymentReplicasMinAvailable
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: 1/2 of replicas are available
          reason: DeploymentReplicasNotAvailable
          status: "False"
          type: DeploymentReplicasAllAvailable
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: |
            Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
          reason: DeploymentRollingOut
          status: "True"
          type: DeploymentRollingOut
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: The endpoint publishing strategy supports a managed load balancer
          reason: WantedByEndpointPublishingStrategy
          status: "True"
          type: LoadBalancerManaged
        - lastTransitionTime: "2023-10-17T07:57:24Z"
          message: The LoadBalancer service is provisioned
          reason: LoadBalancerProvisioned
          status: "True"
          type: LoadBalancerReady
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: LoadBalancer is not progressing
          reason: LoadBalancerNotProgressing
          status: "False"
          type: LoadBalancerProgressing
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: DNS management is supported and zones are specified in the cluster DNS
            config.
          reason: Normal
          status: "True"
          type: DNSManaged
        - lastTransitionTime: "2023-10-17T07:57:26Z"
          message: The record is provisioned in all reported zones.
          reason: NoFailedZones
          status: "True"
          type: DNSReady
        - lastTransitionTime: "2023-10-17T07:57:26Z"
          status: "True"
          type: Available
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: |-
            One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
            )
          reason: IngressControllerProgressing
          status: "True"
          type: Progressing
        - lastTransitionTime: "2023-10-17T07:57:28Z"
          status: "False"
          type: Degraded
        - lastTransitionTime: "2023-10-17T07:41:43Z"
      <-----snip---->

       

      Attachments

        Issue Links

          Activity

            People

              amcdermo@redhat.com ANDREW MCDERMOTT
              openshift-crt-jira-prow OpenShift Prow Bot
              Shudi Li Shudi Li
              Shudi Li
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: