Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.11
Affects Version/s: 4.11
Component/s: Networking / router
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.11
Release Blocker:
None
Sprint:
Sprint 217
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Manually mirroring for backport from https://bugzilla.redhat.com/show_bug.cgi?id=2076297
Description of problem:
For brief window while the openshift-router binary is starting up, it ignores shutdown signals (SIGTERMs) and will never shutdown.

This becomes a larger issue when K8S sends a graceful shutdown while the router is starting up and subsequently waits the terminationGracePeriodSeconds as specified in the router deployment, which is 1 hour.

This becomes even more of an issue with
https://github.com/openshift/cluster-ingress-operator/pull/724
which makes the ingress controller wait for all pods before deleting itself. So if these pods are stuck in Terminating for an hour, then the ingress controller will be stuck in Terminating for an hour.

OpenShift release version:

Cluster Platform:

How reproducible:
You can start/stop the router pod quickly to get it to be stuck in a hour-long Terminating state.

Steps to Reproduce (in detail):
1. Create a YAML file with the following content:

apiVersion: v1
items:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: loadbalancer
namespace: openshift-ingress-operator
spec:
replicas: 1
routeSelector:
matchLabels:
type: loadbalancer
endpointPublishingStrategy:
type: LoadBalancerService
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
status: {}
kind: List
metadata:
resourceVersion: ""
selfLink: ""

2. Run the following command:

oc apply -f <YAML_FILE>.yaml && while ! oc get pod -n openshift-ingress | grep -q router-loadbalancer; do echo "Waiting"; done; oc delete pod -n openshift-ingress $(oc get pod -n openshift-ingress --no-headers | grep router-loadbalancer | awk '{print $1}');

It is considered a failure if it hangs for more than 45 seconds. You can ctrl-c after it deletes the pod and run "oc get pods -n openshift-ingress" to see that it is stuck in a terminating state with a AGE longer than 45 seconds.

The pod will take 1 hour to terminate, but you can always clean up by force deleting it.

Actual results:
Pod takes 1 hour to be deleted.

Expected results:
Pod should be deleted in about 45 seconds.

Impact of the problem:
Router pods hang in terminating for 1 hour and that will affect user experience.

Additional info:

blocks

OCPBUGS-1619 Bug 2076297 - Router process ignores shutdown signal while starting up

Closed

is cloned by

OCPBUGS-1619 Bug 2076297 - Router process ignores shutdown signal while starting up

Closed

Assignee:: Grant Spence

Reporter:: Grant Spence

Need Info From:: None

Contributors:: None

QA Contact:: Hongan Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/09/21 9:41 PM

Updated:: 2025/07/29 5:45 AM

Resolved:: 2022/09/21 9:42 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates