Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: 4.12
Component/s: Networking / ovn-kubernetes
Labels:
- OVN-Kubernetes

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:

4.12.z
Target Version:

4.16.0
Release Blocker:
None
Sprint:
SDN Sprint 253
sprint_count:
1

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

PX Priority Data:
PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

After Cert expires, ovnkube-master starts to log the below error as per our grpc loglevel this is actually a x509, I also confirmed a successful tcp conection was established and torndown in <100ms, https://github.com/grpc/grpc-go/issues/2561

egressip_healthcheck.go:162] Could not connect to $hostname ($ip:9107):context deadline exceeded

Version-Release number of selected component (if applicable):

How reproducible:

Rotate the openshift-ovn-kubernetes/ovn-cert and Wait for the old cert to expire

Steps to Reproduce:

    1. wait %10 days after ovn-cert rotation, and with no pod restarts.
    2. After cert rotation, 18days (%10 of validity) egressIPs will be removed from all nodes.
    3. all nodes will start to fail egress health probes with context deadline exceeded

Actual results:

Silent failure of egressIP healthchecks

Expected results:

    No noticable impact, automatic loading of new cert/restart of pod

Additional info:

    whom ever rotated the cert should be restarting the daemonsets. and we should also log x509 issues in the grpc library.

blocks

OCPBUGS-33619 EgressIP Healthcheck silently breaks 18 days after ovn-cert rotation

Closed

is cloned by

OCPBUGS-33619 EgressIP Healthcheck silently breaks 18 days after ovn-cert rotation

Closed

is triggering

CORENET-969 Corrective Measure for OCPBUGS-32203: EgressIP Healthcheck silently breaks 18 days after ovn-cert rotation

Closed

links to

https://github.com/openshift/ovn-kubernetes/pull/2162

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Upstream fix

(1 links to)

Assignee:: Patryk Diak

Reporter:: Tim Dawson

Need Info From:: None

Contributors:: None

QA Contact:: Jean Chen

Doc Contact:: None

Votes:: 3 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2024/04/14 8:02 AM

Updated:: 2025/09/13 7:30 PM

Resolved:: 2024/06/27 11:46 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates