Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Rejected
Sprint:
SDN Sprint 226, SDN Sprint 227
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

-

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too.

This makes the main command

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

blocks

OCPBUGS-3823 Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

Closed

OCPBUGS-3824 [4.12] Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

Closed

is cloned by

OCPBUGS-3823 Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

Closed

OCPBUGS-3824 [4.12] Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

Closed

CORENET-2481 Add taint mechanism: Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

Closed

links to

BZ Clone

openshift/cluster-network-operator#1606: OCPBUGS-2598: ipsec: Run ovs-monitor-ipsec in the foreground and change probes

(2 links to)

Assignee:: Andreas Karis

Reporter:: Nikolaos Stamatelopoulos

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2022/10/19 2:41 PM

Updated:: 2025/09/13 6:18 AM

Resolved:: 2023/01/22 7:05 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates