Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21803

Ingress stuck in progressing when maxConnections increased to 2000000

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.15.0
    • 4.14, 4.14.z, 4.15
    • Networking / router
    • None
    • Yes
    • Sprint 243, Sprint 244
    • 2
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      * The transition to HAProxy 2.6 included enforcement for the 'strict-limits' configuration, which resulted in fatal errors when 'maxConnections' requirements could not be met. This release introduces a configuration adjustment in response to the transition to the `maxConnections` issues.
      +
      With this update, the HAProxy configuration switches to using `no strict-limits`. As a result, HAProxy no longer fatally exists when the `maxConnection` configuration cannot be satisfied. Instead, it emits warnings and continues running. When `maxConnection` limitations cannot be met, a warning like the following might be returned:
      *
      `[WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576.
      [ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble.`
      +
      To resolve these warnings, we recommend specifying `-1`, or `auto`, for the `maxConnections` field when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limitations in the running container, which eliminates these warnings.
      +
      [IMPORTANT]
      ====
      * The 'strict-limits' setting is not configurable by end users and remains under the control of the HAProxy template.
      ====
      +
      (https://issues.redhat.com/browse/OCPBUGS-21803[*OCPBUGS-21803*])

      Show
      * The transition to HAProxy 2.6 included enforcement for the 'strict-limits' configuration, which resulted in fatal errors when 'maxConnections' requirements could not be met. This release introduces a configuration adjustment in response to the transition to the `maxConnections` issues. + With this update, the HAProxy configuration switches to using `no strict-limits`. As a result, HAProxy no longer fatally exists when the `maxConnection` configuration cannot be satisfied. Instead, it emits warnings and continues running. When `maxConnection` limitations cannot be met, a warning like the following might be returned: * `[WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576. [ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble.` + To resolve these warnings, we recommend specifying `-1`, or `auto`, for the `maxConnections` field when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limitations in the running container, which eliminates these warnings. + [IMPORTANT] ==== * The 'strict-limits' setting is not configurable by end users and remains under the control of the HAProxy template. ==== + ( https://issues.redhat.com/browse/OCPBUGS-21803 [* OCPBUGS-21803 *])
    • Bug Fix
    • Done

      Description of problem:

      The test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926 was created for NE-577 epic. When we increase the 'spec.tuningOptions.maxConnections' to 200000, the default ingress controller stuck in progressing.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926

      Steps to Reproduce:

      1.Edit the defualt controller with max value 2000000oc -n openshift-ingress-operator edit ingresscontroller defaulttuningOptions:
          maxConnections: 2000000
      2.melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml | grep  -A1 tuningOptions
        tuningOptions:
          maxConnections: 2000000
      3. melvinjoseph@mjoseph-mac openshift-tests-private % oc get co/ingress 
      NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      ingress   4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h42m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
      

      Actual results:

      The default ingress controller stuck in progressing

      Expected results:

      The ingress controller should work as normal

      Additional info:

      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
      NAME                              READY   STATUS        RESTARTS   AGE
      router-default-7cf67f448-gb7mr    0/1     Running       0          38s
      router-default-7cf67f448-qmvks    0/1     Running       0          38s
      router-default-7dcd556587-kvk8d   0/1     Terminating   0          3h53m
      router-default-7dcd556587-vppk4   1/1     Running       0          3h53m
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
      NAME                              READY   STATUS    RESTARTS   AGE
      router-default-7cf67f448-gb7mr    0/1     Running   0          111s
      router-default-7cf67f448-qmvks    0/1     Running   0          111s
      router-default-7dcd556587-vppk4   1/1     Running   0          3h55m
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h28m   
      baremetal                                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      cloud-controller-manager                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h58m   
      cloud-credential                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h59m   
      cluster-autoscaler                         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      config-operator                            4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      console                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h34m   
      control-plane-machine-set                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      csi-snapshot-controller                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      dns                                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      etcd                                       4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h47m   
      image-registry                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      176m    
      ingress                                    4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h39m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
      insights                                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h49m   
      kube-apiserver                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
      kube-controller-manager                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
      kube-scheduler                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
      kube-storage-version-migrator              4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      machine-api                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
      machine-approver                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      machine-config                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h53m   
      marketplace                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
      monitoring                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h35m   
      network                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h57m   
      node-tuning                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      openshift-apiserver                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      openshift-controller-manager               4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      openshift-samples                          4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
      operator-lifecycle-manager                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      operator-lifecycle-manager-catalog         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
      operator-lifecycle-manager-packageserver   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
      service-ca                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
      storage                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h36m   
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get po
      NAME                               READY   STATUS    RESTARTS        AGE
      ingress-operator-c6fd989fd-jsrzv   2/2     Running   4 (3h45m ago)   3h58m
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator logs ingress-operator-c6fd989fd-jsrzv -c ingress-operator --tail=20
      2023-10-17T11:34:54.327Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:34:54.394Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.394Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.394Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.397Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.429Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.446Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.553Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.557Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.9999758s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      2023-10-17T11:34:54.558Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.583Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:34:54.657Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.345629987s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      2023-10-17T11:34:54.794Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
      2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
      2023-10-17T11:36:11.248Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "58m42.755479533s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      
       
      melvinjoseph@mjoseph-mac openshift-tests-private % oc get po -n openshift-ingress
      NAME                              READY   STATUS    RESTARTS      AGE
      router-default-7cf67f448-gb7mr    0/1     Running   1 (71s ago)   3m57s
      router-default-7cf67f448-qmvks    0/1     Running   1 (70s ago)   3m57s
      router-default-7dcd556587-vppk4   1/1     Running   0             3h57m
      
      melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-gb7mr --tail=20 
      I1017 11:39:22.623928       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:23.623924       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:24.623373       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:25.627359       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:26.623337       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:27.623603       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:28.623866       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:29.623183       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:30.623475       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:31.623949       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private % 
      melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-qmvks --tail=20
      I1017 11:39:34.553475       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:35.551412       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:36.551421       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      E1017 11:39:37.052068       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
      I1017 11:39:37.551648       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:38.551632       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:39.551410       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:40.552620       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:41.552050       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:42.551076       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I1017 11:39:42.564293       1 template.go:828] router "msg"="Shutdown requested, waiting 45s for new connections to cease" 
      
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller 
      NAME      AGE
      default   3h59m
      melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml
      apiVersion: operator.openshift.io/v1
      <-----snip---->
      status:
        availableReplicas: 1
        conditions:
        - lastTransitionTime: "2023-10-17T07:41:42Z"
          reason: Valid
          status: "True"
          type: Admitted
        - lastTransitionTime: "2023-10-17T07:57:01Z"
          message: The deployment has Available status condition set to True
          reason: DeploymentAvailable
          status: "True"
          type: DeploymentAvailable
        - lastTransitionTime: "2023-10-17T07:57:01Z"
          message: Minimum replicas requirement is met
          reason: DeploymentMinimumReplicasMet
          status: "True"
          type: DeploymentReplicasMinAvailable
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: 1/2 of replicas are available
          reason: DeploymentReplicasNotAvailable
          status: "False"
          type: DeploymentReplicasAllAvailable
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: |
            Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
          reason: DeploymentRollingOut
          status: "True"
          type: DeploymentRollingOut
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: The endpoint publishing strategy supports a managed load balancer
          reason: WantedByEndpointPublishingStrategy
          status: "True"
          type: LoadBalancerManaged
        - lastTransitionTime: "2023-10-17T07:57:24Z"
          message: The LoadBalancer service is provisioned
          reason: LoadBalancerProvisioned
          status: "True"
          type: LoadBalancerReady
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: LoadBalancer is not progressing
          reason: LoadBalancerNotProgressing
          status: "False"
          type: LoadBalancerProgressing
        - lastTransitionTime: "2023-10-17T07:41:43Z"
          message: DNS management is supported and zones are specified in the cluster DNS
            config.
          reason: Normal
          status: "True"
          type: DNSManaged
        - lastTransitionTime: "2023-10-17T07:57:26Z"
          message: The record is provisioned in all reported zones.
          reason: NoFailedZones
          status: "True"
          type: DNSReady
        - lastTransitionTime: "2023-10-17T07:57:26Z"
          status: "True"
          type: Available
        - lastTransitionTime: "2023-10-17T11:34:54Z"
          message: |-
            One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
            )
          reason: IngressControllerProgressing
          status: "True"
          type: Progressing
        - lastTransitionTime: "2023-10-17T07:57:28Z"
          status: "False"
          type: Degraded
        - lastTransitionTime: "2023-10-17T07:41:43Z"
      <-----snip---->

       

            [OCPBUGS-21803] Ingress stuck in progressing when maxConnections increased to 2000000

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:7198

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7198

            Hongan Li added a comment -

            shudili@redhat.com seems no accepted 4.15 nightly build so far, but CI build 4.15.0-0.ci-2023-10-18-220257 has the fix, could you try to verify this with CI build, since it is blocker bug.

            Hongan Li added a comment - shudili@redhat.com seems no accepted 4.15 nightly build so far, but CI build 4.15.0-0.ci-2023-10-18-220257 has the fix, could you try to verify this with CI build, since it is blocker bug.

            Hi amcdermo@redhat.com,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi amcdermo@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Andrew McDermott added a comment - - edited

            I did start with this untested patch to haproxy (https://gist.github.com/frobware/2fc5d5f27e639c1f04f7ad868923fd3f), but I changed my approach when I realised that we could potentially fix the issue by adding 'no strict-limits' to the config file. The two linked PRs–and the proposed fix–use a template change:

            Andrew McDermott added a comment - - edited I did start with this untested patch to haproxy ( https://gist.github.com/frobware/2fc5d5f27e639c1f04f7ad868923fd3f ), but I changed my approach when I realised that we could potentially fix the issue by adding 'no strict-limits' to the config file. The two linked PRs–and the proposed fix–use a template change: https://github.com/openshift/cluster-ingress-operator/pull/983 https://github.com/openshift/router/pull/527

            calfonso@redhat.com the proposal is to revert and carry a haproxy patch that will turn the fatal error back to a warning, as it was in haproxy-2.2. I will first do the patch for 4.15, then apply to 4.14. ETA for patches? Tentatively tomorrow.

            Andrew McDermott added a comment - calfonso@redhat.com the proposal is to revert and carry a haproxy patch that will turn the fatal error back to a warning, as it was in haproxy-2.2. I will first do the patch for 4.15, then apply to 4.14. ETA for patches? Tentatively tomorrow.

            Andrew McDermott added a comment - - edited

            In 4.14, if you use 2000000 for maxconn we see:

             

            % oc version
            Client Version: 4.13.14
            Kustomize Version: v4.5.7
            Server Version: 4.14.0-0.nightly-2023-10-13-073537
            Kubernetes Version: v1.27.6+98158f9
            
            sh-4.4$ head haproxy.config
            global
              maxconn 2000000
              nbthread 4
            
            sh-4.4$ vi haproxy.config
            sh-4.4$ ../reload-haproxy
            [NOTICE]   (48) : haproxy version is 2.6.13-234aa6d
            [NOTICE]   (48) : path to executable is /usr/sbin/haproxy
            [ALERT]    (48) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576.
            

            whereas in 4.13 we see:

            % oc version
            Client Version: 4.13.14
            Kustomize Version: v4.5.7
            Server Version: 4.13.14
            Kubernetes Version: v1.26.9+52589e6
            
            % oc rsh -n openshift-ingress router-default-85976855d7-fw8h2
            Defaulted container "router" out of: router, logs
            sh-4.4$ head haproxy.config
            global
              maxconn 2000000
              nbthread 1
            
            sh-4.4$ ../reload-haproxy
            [WARNING] 289/172332 (418) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000411, limit is 1048576. This will fail in >= v2.3
            [NOTICE] 289/172332 (418) : haproxy version is 2.2.24-26b8015
            [NOTICE] 289/172332 (418) : path to executable is /usr/sbin/haproxy
            [ALERT] 289/172332 (418) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000411. Please raise 'ulimit-n' to 4000411 or more to avoid any trouble.This will fail in >= v2.3
             - Checking http://localhost:80 ...
             - Health check ok : 0 retry attempt(s).
            

            The root cause is the bump from haproxy-2.2 => 2.6. Per the log message, this was previously a warning, now fatal. Investigating.

            Andrew McDermott added a comment - - edited In 4.14, if you use 2000000 for maxconn we see:   % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-10-13-073537 Kubernetes Version: v1.27.6+98158f9 sh-4.4$ head haproxy.config global maxconn 2000000 nbthread 4 sh-4.4$ vi haproxy.config sh-4.4$ ../reload-haproxy [NOTICE] (48) : haproxy version is 2.6.13-234aa6d [NOTICE] (48) : path to executable is /usr/sbin/haproxy [ALERT] (48) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576. whereas in 4.13 we see: % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.13.14 Kubernetes Version: v1.26.9+52589e6 % oc rsh -n openshift-ingress router-default-85976855d7-fw8h2 Defaulted container "router" out of: router, logs sh-4.4$ head haproxy.config global maxconn 2000000 nbthread 1 sh-4.4$ ../reload-haproxy [WARNING] 289/172332 (418) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000411, limit is 1048576. This will fail in >= v2.3 [NOTICE] 289/172332 (418) : haproxy version is 2.2.24-26b8015 [NOTICE] 289/172332 (418) : path to executable is /usr/sbin/haproxy [ALERT] 289/172332 (418) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000411. Please raise 'ulimit-n' to 4000411 or more to avoid any trouble.This will fail in >= v2.3 - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). The root cause is the bump from haproxy-2.2 => 2.6. Per the log message, this was previously a warning, now fatal. Investigating.

            I have marked this as a 4.14.0 blocker because this appears to be a regression that could break clusters where the cluster-admin has set maxConnections to a value larger than 1048576 and then upgrades to 4.14.

            Miciah Masters added a comment - I have marked this as a 4.14.0 blocker because this appears to be a regression that could break clusters where the cluster-admin has set maxConnections to a value larger than 1048576 and then upgrades to 4.14.

            I was also unable to reproduce on 4.13.

            % oc version
            Client Version: 4.13.14
            Kustomize Version: v4.5.7
            Server Version: 4.13.12
            Kubernetes Version: v1.26.7+0ef5eae
            
            % oc get pods -n openshift-ingress
            NAME                                       READY   STATUS    RESTARTS   AGE
            helloworld-1                               1/1     Running   5          21d
            router-default-7454dd97c5-sck87            2/2     Running   0          7m31s
            router-default-7454dd97c5-sg9mz            2/2     Running   0          8m55s
            router-my-custom-router-68657bf69d-vkk4t   1/1     Running   1          7d5h
            [spicy] ~nix-config
            % oc rsh -n openshift-ingress router-default-7454dd97c5-sck87 head haproxy.config
            Defaulted container "router" out of: router, logs
            global
              maxconn 2000000
              nbthread 1
            

             

            Andrew McDermott added a comment - I was also unable to reproduce on 4.13. % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.13.12 Kubernetes Version: v1.26.7+0ef5eae % oc get pods -n openshift-ingress NAME READY STATUS RESTARTS AGE helloworld-1 1/1 Running 5 21d router-default-7454dd97c5-sck87 2/2 Running 0 7m31s router-default-7454dd97c5-sg9mz 2/2 Running 0 8m55s router-my-custom-router-68657bf69d-vkk4t 1/1 Running 1 7d5h [spicy] ~nix-config % oc rsh -n openshift-ingress router-default-7454dd97c5-sck87 head haproxy.config Defaulted container "router" out of: router, logs global maxconn 2000000 nbthread 1  

            This issue is not there in 4.13. I ran automation on 4.13 and it is passing.

            Same automation is falling in 4.14 and 4.15.

            passed: (8m28s) 2023-10-17T14:14:32 "[sig-network-edge] Network_Edge should Author:shudili-NonPreRelease-Longduration-High-50926-Support a Configurable ROUTER_MAX_CONNECTIONS in HAproxy"

            Writing JUnit report to junit_e2e_20231017-141432.xml

            1 pass, 0 skip (8m28s)
            melvinjoseph@mjoseph-mac openshift-tests-private % oc get co
            NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.13.0-0.nightly-2023-10-14-180128   True        False         False      5h12m   

            Melvin Joseph added a comment - This issue is not there in 4.13. I ran automation on 4.13 and it is passing. Same automation is falling in 4.14 and 4.15. passed: (8m28s) 2023-10-17T14:14:32 " [sig-network-edge] Network_Edge should Author:shudili-NonPreRelease-Longduration-High-50926-Support a Configurable ROUTER_MAX_CONNECTIONS in HAproxy" Writing JUnit report to junit_e2e_20231017-141432.xml 1 pass, 0 skip (8m28s) melvinjoseph@mjoseph-mac openshift-tests-private % oc get co NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.13.0-0.nightly-2023-10-14-180128   True        False         False      5h12m   

            Andrew McDermott added a comment - - edited

            The issue is that the container limit is 1048576.

            sh-4.4$ ulimit -a
            core file size          (blocks, -c) unlimited
            data seg size           (kbytes, -d) unlimited
            scheduling priority             (-e) 0
            file size               (blocks, -f) unlimited
            pending signals                 (-i) 63606
            max locked memory       (kbytes, -l) 8192
            max memory size         (kbytes, -m) unlimited
            open files                      (-n) 1048576
            pipe size            (512 bytes, -p) 8
            POSIX message queues     (bytes, -q) 819200
            real-time priority              (-r) 0
            stack size              (kbytes, -s) 8192
            cpu time               (seconds, -t) unlimited
            max user processes              (-u) 4194304
            virtual memory          (kbytes, -v) unlimited
            file locks                      (-x) unlimited
            

            You would need to use tuned to raise the limit on the node and, in turn, for all containers.

             

            % oc version
            Client Version: 4.13.14
            Kustomize Version: v4.5.7
            Server Version: 4.14.0-0.nightly-2023-10-13-073537
            Kubernetes Version: v1.27.6+98158f9
            

            Andrew McDermott added a comment - - edited The issue is that the container limit is 1048576. sh-4.4$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63606 max locked memory (kbytes, -l) 8192 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4194304 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited You would need to use tuned to raise the limit on the node and, in turn, for all containers.   % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-10-13-073537 Kubernetes Version: v1.27.6+98158f9

              amcdermo@redhat.com Andrew McDermott
              rhn-support-mjoseph Melvin Joseph
              Shudi Li Shudi Li
              Shudi Li
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: