[OCPBUGS-21803] Ingress stuck in progressing when maxConnections increased to 2000000

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.14, 4.14.z, 4.15
Component/s: Networking / router
Labels:
None

Regression:
Yes
Sprint:
Sprint 243, Sprint 244
sprint_count:
2
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* The transition to HAProxy 2.6 included enforcement for the 'strict-limits' configuration, which resulted in fatal errors when 'maxConnections' requirements could not be met. This release introduces a configuration adjustment in response to the transition to the `maxConnections` issues.
+
With this update, the HAProxy configuration switches to using `no strict-limits`. As a result, HAProxy no longer fatally exists when the `maxConnection` configuration cannot be satisfied. Instead, it emits warnings and continues running. When `maxConnection` limitations cannot be met, a warning like the following might be returned:
*
`[WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576.
[ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble.`
+
To resolve these warnings, we recommend specifying `-1`, or `auto`, for the `maxConnections` field when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limitations in the running container, which eliminates these warnings.
+
[IMPORTANT]
====
* The 'strict-limits' setting is not configurable by end users and remains under the control of the HAProxy template.
====
+
(https://issues.redhat.com/browse/OCPBUGS-21803[*~~OCPBUGS-21803~~*])

Show
* The transition to HAProxy 2.6 included enforcement for the 'strict-limits' configuration, which resulted in fatal errors when 'maxConnections' requirements could not be met. This release introduces a configuration adjustment in response to the transition to the `maxConnections` issues. + With this update, the HAProxy configuration switches to using `no strict-limits`. As a result, HAProxy no longer fatally exists when the `maxConnection` configuration cannot be satisfied. Instead, it emits warnings and continues running. When `maxConnection` limitations cannot be met, a warning like the following might be returned: * `[WARNING] (50) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576. [ALERT] (50) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000237. Please raise 'ulimit-n' to 4000237 or more to avoid any trouble.` + To resolve these warnings, we recommend specifying `-1`, or `auto`, for the `maxConnections` field when tuning an IngressController. This choice allows HAProxy to dynamically calculate the maximum value based on the available resource limitations in the running container, which eliminates these warnings. + [IMPORTANT] ==== * The 'strict-limits' setting is not configurable by end users and remains under the control of the HAProxy template. ==== + ( https://issues.redhat.com/browse/OCPBUGS-21803 [* OCPBUGS-21803 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926 was created for NE-577 epic. When we increase the 'spec.tuningOptions.maxConnections' to 200000, the default ingress controller stuck in progressing.

Version-Release number of selected component (if applicable):

How reproducible:

https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926

Steps to Reproduce:

1.Edit the defualt controller with max value 2000000oc -n openshift-ingress-operator edit ingresscontroller defaulttuningOptions:
    maxConnections: 2000000
2.melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml | grep  -A1 tuningOptions
  tuningOptions:
    maxConnections: 2000000
3. melvinjoseph@mjoseph-mac openshift-tests-private % oc get co/ingress 
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h42m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......

Actual results:

The default ingress controller stuck in progressing

Expected results:

The ingress controller should work as normal

Additional info:

melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
NAME                              READY   STATUS        RESTARTS   AGE
router-default-7cf67f448-gb7mr    0/1     Running       0          38s
router-default-7cf67f448-qmvks    0/1     Running       0          38s
router-default-7dcd556587-kvk8d   0/1     Terminating   0          3h53m
router-default-7dcd556587-vppk4   1/1     Running       0          3h53m
melvinjoseph@mjoseph-mac openshift-tests-private % 

melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7cf67f448-gb7mr    0/1     Running   0          111s
router-default-7cf67f448-qmvks    0/1     Running   0          111s
router-default-7dcd556587-vppk4   1/1     Running   0          3h55m

melvinjoseph@mjoseph-mac openshift-tests-private % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h28m   
baremetal                                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
cloud-controller-manager                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h58m   
cloud-credential                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h59m   
cluster-autoscaler                         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
config-operator                            4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
console                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h34m   
control-plane-machine-set                  4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
csi-snapshot-controller                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
dns                                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
etcd                                       4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h47m   
image-registry                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      176m    
ingress                                    4.15.0-0.nightly-2023-10-16-231617   True        True          False      3h39m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
insights                                   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h49m   
kube-apiserver                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
kube-controller-manager                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
kube-scheduler                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h46m   
kube-storage-version-migrator              4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
machine-api                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h45m   
machine-approver                           4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
machine-config                             4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h53m   
marketplace                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h55m   
monitoring                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h35m   
network                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h57m   
node-tuning                                4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
openshift-apiserver                        4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
openshift-controller-manager               4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
openshift-samples                          4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h39m   
operator-lifecycle-manager                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
operator-lifecycle-manager-catalog         4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h54m   
operator-lifecycle-manager-packageserver   4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h43m   
service-ca                                 4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h56m   
storage                                    4.15.0-0.nightly-2023-10-16-231617   True        False         False      3h36m   
melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get po
NAME                               READY   STATUS    RESTARTS        AGE
ingress-operator-c6fd989fd-jsrzv   2/2     Running   4 (3h45m ago)   3h58m
melvinjoseph@mjoseph-mac openshift-tests-private % 


melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator logs ingress-operator-c6fd989fd-jsrzv -c ingress-operator --tail=20
2023-10-17T11:34:54.327Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
2023-10-17T11:34:54.348Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
2023-10-17T11:34:54.394Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.394Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.394Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.397Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.429Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.446Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.553Z    INFO    operator.ingressclass_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.553Z    INFO    operator.route_metrics_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.553Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.557Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.9999758s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
2023-10-17T11:34:54.558Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.583Z    INFO    operator.status_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:34:54.657Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "59m59.345629987s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
2023-10-17T11:34:54.794Z    INFO    operator.certificate_controller    controller/controller.go:118    Reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    handler/enqueue_mapped.go:81    queueing ingress    {"name": "default", "related": ""}
2023-10-17T11:36:11.151Z    INFO    operator.ingress_controller    controller/controller.go:118    reconciling    {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2023-10-17T11:36:11.248Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "58m42.755479533s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
melvinjoseph@mjoseph-mac openshift-tests-private % 

 
melvinjoseph@mjoseph-mac openshift-tests-private % oc get po -n openshift-ingress
NAME                              READY   STATUS    RESTARTS      AGE
router-default-7cf67f448-gb7mr    0/1     Running   1 (71s ago)   3m57s
router-default-7cf67f448-qmvks    0/1     Running   1 (70s ago)   3m57s
router-default-7dcd556587-vppk4   1/1     Running   0             3h57m

melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-gb7mr --tail=20 
I1017 11:39:22.623928       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:23.623924       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:24.623373       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:25.627359       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:26.623337       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:27.623603       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:28.623866       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:29.623183       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:30.623475       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:31.623949       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
melvinjoseph@mjoseph-mac openshift-tests-private % 
melvinjoseph@mjoseph-mac openshift-tests-private % 
melvinjoseph@mjoseph-mac openshift-tests-private % 
melvinjoseph@mjoseph-mac openshift-tests-private %   oc -n openshift-ingress logs router-default-7cf67f448-qmvks --tail=20
I1017 11:39:34.553475       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:35.551412       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:36.551421       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
E1017 11:39:37.052068       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I1017 11:39:37.551648       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:38.551632       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:39.551410       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:40.552620       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:41.552050       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:42.551076       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I1017 11:39:42.564293       1 template.go:828] router "msg"="Shutdown requested, waiting 45s for new connections to cease" 

melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller 
NAME      AGE
default   3h59m
melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml
apiVersion: operator.openshift.io/v1
<-----snip---->
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2023-10-17T07:41:42Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2023-10-17T07:57:01Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2023-10-17T07:57:01Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2023-10-17T11:34:54Z"
    message: 1/2 of replicas are available
    reason: DeploymentReplicasNotAvailable
    status: "False"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2023-10-17T11:34:54Z"
    message: |
      Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
    reason: DeploymentRollingOut
    status: "True"
    type: DeploymentRollingOut
  - lastTransitionTime: "2023-10-17T07:41:43Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2023-10-17T07:57:24Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2023-10-17T07:41:43Z"
    message: LoadBalancer is not progressing
    reason: LoadBalancerNotProgressing
    status: "False"
    type: LoadBalancerProgressing
  - lastTransitionTime: "2023-10-17T07:41:43Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2023-10-17T07:57:26Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2023-10-17T07:57:26Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2023-10-17T11:34:54Z"
    message: |-
      One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...
      )
    reason: IngressControllerProgressing
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-10-17T07:57:28Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2023-10-17T07:41:43Z"
<-----snip---->

blocks

OCPBUGS-21898 Ingress stuck in progressing when maxConnections increased to 2000000

Closed

is cloned by

OCPBUGS-21898 Ingress stuck in progressing when maxConnections increased to 2000000

Closed

links to

openshift/cluster-ingress-operator#983: OCPBUGS-21803: test/e2e: Add test case for 2000000 maxConnections

openshift/router#527: OCPBUGS-21803: haproxy-template: Add 'no strict-limits' to address HAProxy 2.6 issue

RHEA-2023:7198 rpm

Errata Tool added a comment - 2024/02/27 8:58 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:7198

Errata Tool added a comment - 2024/02/27 8:58 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7198

Hongan Li added a comment - 2023/10/20 12:20 AM

shudili@redhat.com seems no accepted 4.15 nightly build so far, but CI build 4.15.0-0.ci-2023-10-18-220257 has the fix, could you try to verify this with CI build, since it is blocker bug.

Hongan Li added a comment - 2023/10/20 12:20 AM shudili@redhat.com seems no accepted 4.15 nightly build so far, but CI build 4.15.0-0.ci-2023-10-18-220257 has the fix, could you try to verify this with CI build, since it is blocker bug.

OpenShift Jira Bot added a comment - 2023/10/18 7:58 PM

Hi amcdermo@redhat.com,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2023/10/18 7:58 PM Hi amcdermo@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Andrew McDermott added a comment - 2023/10/18 11:02 AM - edited

I did start with this untested patch to haproxy (https://gist.github.com/frobware/2fc5d5f27e639c1f04f7ad868923fd3f), but I changed my approach when I realised that we could potentially fix the issue by adding 'no strict-limits' to the config file. The two linked PRs–and the proposed fix–use a template change:

Andrew McDermott added a comment - 2023/10/18 11:02 AM - edited I did start with this untested patch to haproxy ( https://gist.github.com/frobware/2fc5d5f27e639c1f04f7ad868923fd3f ), but I changed my approach when I realised that we could potentially fix the issue by adding 'no strict-limits' to the config file. The two linked PRs–and the proposed fix–use a template change: https://github.com/openshift/cluster-ingress-operator/pull/983 https://github.com/openshift/router/pull/527

Andrew McDermott added a comment - 2023/10/17 5:35 PM

calfonso@redhat.com the proposal is to revert and carry a haproxy patch that will turn the fatal error back to a warning, as it was in haproxy-2.2. I will first do the patch for 4.15, then apply to 4.14. ETA for patches? Tentatively tomorrow.

Andrew McDermott added a comment - 2023/10/17 5:35 PM calfonso@redhat.com the proposal is to revert and carry a haproxy patch that will turn the fatal error back to a warning, as it was in haproxy-2.2. I will first do the patch for 4.15, then apply to 4.14. ETA for patches? Tentatively tomorrow.

Andrew McDermott added a comment - 2023/10/17 5:24 PM - edited

In 4.14, if you use 2000000 for maxconn we see:

% oc version
Client Version: 4.13.14
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-10-13-073537
Kubernetes Version: v1.27.6+98158f9

sh-4.4$ head haproxy.config
global
  maxconn 2000000
  nbthread 4

sh-4.4$ vi haproxy.config
sh-4.4$ ../reload-haproxy
[NOTICE]   (48) : haproxy version is 2.6.13-234aa6d
[NOTICE]   (48) : path to executable is /usr/sbin/haproxy
[ALERT]    (48) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576.

whereas in 4.13 we see:

% oc version
Client Version: 4.13.14
Kustomize Version: v4.5.7
Server Version: 4.13.14
Kubernetes Version: v1.26.9+52589e6

% oc rsh -n openshift-ingress router-default-85976855d7-fw8h2
Defaulted container "router" out of: router, logs
sh-4.4$ head haproxy.config
global
  maxconn 2000000
  nbthread 1

sh-4.4$ ../reload-haproxy
[WARNING] 289/172332 (418) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000411, limit is 1048576. This will fail in >= v2.3
[NOTICE] 289/172332 (418) : haproxy version is 2.2.24-26b8015
[NOTICE] 289/172332 (418) : path to executable is /usr/sbin/haproxy
[ALERT] 289/172332 (418) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000411. Please raise 'ulimit-n' to 4000411 or more to avoid any trouble.This will fail in >= v2.3
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

The root cause is the bump from haproxy-2.2 => 2.6. Per the log message, this was previously a warning, now fatal. Investigating.

Andrew McDermott added a comment - 2023/10/17 5:24 PM - edited In 4.14, if you use 2000000 for maxconn we see: % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-10-13-073537 Kubernetes Version: v1.27.6+98158f9 sh-4.4$ head haproxy.config global maxconn 2000000 nbthread 4 sh-4.4$ vi haproxy.config sh-4.4$ ../reload-haproxy [NOTICE] (48) : haproxy version is 2.6.13-234aa6d [NOTICE] (48) : path to executable is /usr/sbin/haproxy [ALERT] (48) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000237, limit is 1048576. whereas in 4.13 we see: % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.13.14 Kubernetes Version: v1.26.9+52589e6 % oc rsh -n openshift-ingress router-default-85976855d7-fw8h2 Defaulted container "router" out of: router, logs sh-4.4$ head haproxy.config global maxconn 2000000 nbthread 1 sh-4.4$ ../reload-haproxy [WARNING] 289/172332 (418) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 4000411, limit is 1048576. This will fail in >= v2.3 [NOTICE] 289/172332 (418) : haproxy version is 2.2.24-26b8015 [NOTICE] 289/172332 (418) : path to executable is /usr/sbin/haproxy [ALERT] 289/172332 (418) : [/usr/sbin/haproxy.main()] FD limit (1048576) too low for maxconn=2000000/maxsock=4000411. Please raise 'ulimit-n' to 4000411 or more to avoid any trouble.This will fail in >= v2.3 - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). The root cause is the bump from haproxy-2.2 => 2.6. Per the log message, this was previously a warning, now fatal. Investigating.

Miciah Masters added a comment - 2023/10/17 3:22 PM

I have marked this as a 4.14.0 blocker because this appears to be a regression that could break clusters where the cluster-admin has set maxConnections to a value larger than 1048576 and then upgrades to 4.14.

Miciah Masters added a comment - 2023/10/17 3:22 PM I have marked this as a 4.14.0 blocker because this appears to be a regression that could break clusters where the cluster-admin has set maxConnections to a value larger than 1048576 and then upgrades to 4.14.

Andrew McDermott added a comment - 2023/10/17 2:48 PM

I was also unable to reproduce on 4.13.

% oc version
Client Version: 4.13.14
Kustomize Version: v4.5.7
Server Version: 4.13.12
Kubernetes Version: v1.26.7+0ef5eae

% oc get pods -n openshift-ingress
NAME                                       READY   STATUS    RESTARTS   AGE
helloworld-1                               1/1     Running   5          21d
router-default-7454dd97c5-sck87            2/2     Running   0          7m31s
router-default-7454dd97c5-sg9mz            2/2     Running   0          8m55s
router-my-custom-router-68657bf69d-vkk4t   1/1     Running   1          7d5h
[spicy] ~nix-config
% oc rsh -n openshift-ingress router-default-7454dd97c5-sck87 head haproxy.config
Defaulted container "router" out of: router, logs
global
  maxconn 2000000
  nbthread 1

Andrew McDermott added a comment - 2023/10/17 2:48 PM I was also unable to reproduce on 4.13. % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.13.12 Kubernetes Version: v1.26.7+0ef5eae % oc get pods -n openshift-ingress NAME READY STATUS RESTARTS AGE helloworld-1 1/1 Running 5 21d router-default-7454dd97c5-sck87 2/2 Running 0 7m31s router-default-7454dd97c5-sg9mz 2/2 Running 0 8m55s router-my-custom-router-68657bf69d-vkk4t 1/1 Running 1 7d5h [spicy] ~nix-config % oc rsh -n openshift-ingress router-default-7454dd97c5-sck87 head haproxy.config Defaulted container "router" out of: router, logs global maxconn 2000000 nbthread 1

Melvin Joseph added a comment - 2023/10/17 2:27 PM

This issue is not there in 4.13. I ran automation on 4.13 and it is passing.

Same automation is falling in 4.14 and 4.15.

passed: (8m28s) 2023-10-17T14:14:32 "[sig-network-edge] Network_Edge should Author:shudili-NonPreRelease-Longduration-High-50926-Support a Configurable ROUTER_MAX_CONNECTIONS in HAproxy"

Writing JUnit report to junit_e2e_20231017-141432.xml

1 pass, 0 skip (8m28s)
melvinjoseph@mjoseph-mac openshift-tests-private % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.13.0-0.nightly-2023-10-14-180128 True False False 5h12m

Melvin Joseph added a comment - 2023/10/17 2:27 PM This issue is not there in 4.13. I ran automation on 4.13 and it is passing. Same automation is falling in 4.14 and 4.15. passed: (8m28s) 2023-10-17T14:14:32 " [sig-network-edge] Network_Edge should Author:shudili-NonPreRelease-Longduration-High-50926-Support a Configurable ROUTER_MAX_CONNECTIONS in HAproxy" Writing JUnit report to junit_e2e_20231017-141432.xml 1 pass, 0 skip (8m28s) melvinjoseph@mjoseph-mac openshift-tests-private % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-10-14-180128 True False False 5h12m

Andrew McDermott added a comment - 2023/10/17 2:27 PM - edited

The issue is that the container limit is 1048576.

sh-4.4$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63606
max locked memory       (kbytes, -l) 8192
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4194304
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

You would need to use tuned to raise the limit on the node and, in turn, for all containers.

% oc version
Client Version: 4.13.14
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-10-13-073537
Kubernetes Version: v1.27.6+98158f9

Andrew McDermott added a comment - 2023/10/17 2:27 PM - edited The issue is that the container limit is 1048576. sh-4.4$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63606 max locked memory (kbytes, -l) 8192 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4194304 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited You would need to use tuned to raise the limit on the node and, in turn, for all containers. % oc version Client Version: 4.13.14 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-10-13-073537 Kubernetes Version: v1.27.6+98158f9

Assignee:: Andrew McDermott

Reporter:: Melvin Joseph

QA Contact:: Shudi Li

Contributors:: Shudi Li

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/10/17 11:54 AM

Updated:: 2024/02/27 8:58 PM

Resolved:: 2024/02/27 8:58 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/02/27 8:58 PM

Expand comment: Errata Tool added a comment - 2024/02/27 8:58 PM

Collapse comment: Hongan Li added a comment - 2023/10/20 12:20 AM

Expand comment: Hongan Li added a comment - 2023/10/20 12:20 AM

Collapse comment: OpenShift Jira Bot added a comment - 2023/10/18 7:58 PM

Expand comment: OpenShift Jira Bot added a comment - 2023/10/18 7:58 PM

Collapse comment: Andrew McDermott added a comment - 2023/10/18 11:02 AM, Edited by Andrew McDermott - 2023/10/18 11:10 AM

Expand comment: Andrew McDermott added a comment - 2023/10/18 11:02 AM, Edited by Andrew McDermott - 2023/10/18 11:10 AM

Collapse comment: Andrew McDermott added a comment - 2023/10/17 5:35 PM

Expand comment: Andrew McDermott added a comment - 2023/10/17 5:35 PM

Collapse comment: Andrew McDermott added a comment - 2023/10/17 5:24 PM, Edited by Andrew McDermott - 2023/10/17 5:30 PM

Expand comment: Andrew McDermott added a comment - 2023/10/17 5:24 PM, Edited by Andrew McDermott - 2023/10/17 5:30 PM

Collapse comment: Miciah Masters added a comment - 2023/10/17 3:22 PM

Expand comment: Miciah Masters added a comment - 2023/10/17 3:22 PM

Collapse comment: Andrew McDermott added a comment - 2023/10/17 2:48 PM

Expand comment: Andrew McDermott added a comment - 2023/10/17 2:48 PM

Collapse comment: Melvin Joseph added a comment - 2023/10/17 2:27 PM

Expand comment: Melvin Joseph added a comment - 2023/10/17 2:27 PM

Collapse comment: Andrew McDermott added a comment - 2023/10/17 2:27 PM, Edited by Andrew McDermott - 2023/10/17 2:46 PM

Expand comment: Andrew McDermott added a comment - 2023/10/17 2:27 PM, Edited by Andrew McDermott - 2023/10/17 2:46 PM

People

Dates