-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.12
-
+
-
Important
-
No
-
SDN Sprint 246, SDN Sprint 247, SDN Sprint 248
-
3
-
False
-
-
Customer Escalated
Description of problem:
The users are experiencing an issue with NodePort traffic forwarding, where the TCP traffic continues to be directed to pods which are under terminating state, the connection cannot be created sucessfully, as per the customer mentioned this issue is causing the connection disruptions in the business transaction.
Version-Release number of selected component (if applicable):
On the OpenShift 4.12.13 with RHEL8.6 workers and OVN environment.
How reproducible:
here is the code found.
https://github.com/openshift/ovn-kubernetes/blob/dd3c7ed8c1f41873168d3df26084ecbfd3d9a36b/go-controller/pkg/util/kube.go#L360;
—
func IsEndpointServing(endpoint discovery.Endpoint) bool {
if endpoint.Conditions.Serving != nil
else
{ return IsEndpointReady(endpoint) }}
// IsEndpointValid takes as input an endpoint from an endpoint slice and a boolean that indicates whether to include
// all terminating endpoints, as per the PublishNotReadyAddresses feature in kubernetes service spec. It always returns true
// if includeTerminating is true and falls back to IsEndpointServing otherwise.
func IsEndpointValid(endpoint discovery.Endpoint, includeTerminating bool) bool
—
Look like 'IsEndpointValid' function will retrun serving=true endpoint, it not checking the ready=true endpoint
I see recently the code has been changed in this section(look up Ready=true is changed to Serving=true)?
[Check the "Serving" field for endpoints]
https://github.com/openshift/ovn-kubernetes/commit/aceef010daf0697fe81dba91a39ed0fdb6563dea#diff-daf9de695e0ff81f9173caf83cb88efa138e92a9b35439bd7044aa012ff931c0
https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/util/kube.go#L326-L386
—
out.Port = *port.Port
for _, endpoint := range slice.Endpoints {
// Skip endpoint if it's not valid
if !IsEndpointValid(endpoint, includeTerminating)
for _, ip := range endpoint.Addresses {
klog.V(4).Infof("Adding slice %s endpoint: %v, port: %d", slice.Name, endpoint.Addresses, *port.Port)
ipStr := utilnet.ParseIPSloppy(ip).String()
switch slice.AddressType
}
}
—
Steps to Reproduce:
Here is the customer's sample pods for you refering.
mbgateway-st-8576f6f6f8-5jc75 1/1 Running 0 104m 172.30.195.124 appn01-100.app.paas.example.com <none> <none>
mbgateway-st-8576f6f6f8-q8j6k 1/1 Running 0 5m51s 172.31.2.97 appn01-202.app.paas.example.com <none> <none>
pod yaml:
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
name: mbgateway-st
ports:
- containerPort: 9190
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
resources:
limits:
cpu: "2"
ephemeral-storage: 10Gi
memory: 2G
requests:
cpu: 50m
ephemeral-storage: 100Mi
memory: 1111M
when delete pod Pod(mbgateway-st-8576f6f6f8-5jc75), check the EndpointSlice status:
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 172.30.195.124
conditions:
ready: false
serving: true
terminating: true
nodeName: appn01-100.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-5jc75
namespace: lb59-10-st-unigateway
uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
zone: AZ61QEBIZ_AZ61QEM02_FD3 - addresses:
- 172.31.2.97
conditions:
ready: true
serving: true
terminating: false
nodeName: appn01-202.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-q8j6k
namespace: lb59-10-st-unigateway
uid: 5bd195b7-e342-4b34-b165-12988a48e445
zone: AZ61QEBIZ_AZ61QEM02_FD1
Wait for a little moment, try to check Ovn Service lb, it found the endpoints information doesn't update to the latest.
9349d703-1f28-41fe-b505-282e8abf4c40 Service_lb59-10- tcp 172.35.0.185:31693 172.30.195.124:9190,172.31.2.97:9190
dca65745-fac4-4e73-b412-2c7530cf4a91 Service_lb59-10- tcp 172.35.0.170:31693 172.30.195.124:9190,172.31.2.97:9190
a5a65766-b0f2-4ac6-8f7c-cdebeea303e3 Service_lb59-10- tcp 172.35.0.89:31693 172.30.195.124:9190,172.31.2.97:9190
a36517c5-ecaa-4a41-b686-37c202478b98 Service_lb59-10- tcp 172.35.0.213:31693 172.30.195.124:9190,172.31.2.97:9190
16d997d1-27f0-41a3-8a9f-c63c8872d7b8 Service_lb59-10- tcp 172.35.0.92:31693 172.30.195.124:9190,172.31.2.97:9190
Wait for a little moment,
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 172.30.195.124
conditions:
ready: false
serving: true
terminating: true
nodeName: appn01-100.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-5jc75
namespace: lb59-10-st-unigateway
uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
zone: AZ61QEBIZ_AZ61QEM02_FD3 - addresses:
- 172.31.2.97
conditions:
ready: true
serving: true
terminating: false
nodeName: appn01-202.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-q8j6k
namespace: lb59-10-st-unigateway
uid: 5bd195b7-e342-4b34-b165-12988a48e445
zone: AZ61QEBIZ_AZ61QEM02_FD1 - addresses:
- 172.30.132.78
conditions:
ready: false
serving: false
terminating: false
nodeName: appn01-089.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-8lp4s
namespace: lb59-10-st-unigateway
uid: 755cbd49-792b-4527-b96a-087be2178e9d
zone: AZ61QEBIZ_AZ61QEM02_FD3
check Ovn Service lb, it found the Pod Endpoint information is still here:
fceeaf8f-e747-4290-864c-ba93fb565a8a Service_lb59-10- tcp 172.35.0.56:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
bef42efd-26db-4df3-b99d-370791988053 Service_lb59-10- tcp 172.35.1.26:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
84172e2c-081c-496a-afec-25ebcb83cc60 Service_lb59-10- tcp 172.35.0.118:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
34412ddd-ab5c-4b6b-95a3-6e718dd20a4f Service_lb59-10- tcp 172.35.1.14:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
Actual results:
Service LB endpoint determines on the POD.status.condition[type=Serving] status.
Expected results:
Service LB endpoint should determines on the POD.status.condition[type=Ready] status.
Additional info:
The ovn-controller determines whether an endpoint should be added to the Service Load Balancer (serviceLB) based on the condition.serving. The current issue is that when a pod is in the terminating state, the condition.serving remains true. Its state determines on the POD.status.condition[type=Ready] status is being true.
However when a pod is deleted, the endpointslice condition.serving state remains unchanged, and the backend pool of the service LB still includes the IP information of the deleted pod.Why doesn't ovn-controller use the condition.ready status to decide whether the pod's IP should be added to the service LB backend pool?
Could the shift-networking experts confirm whether this is the openshift ovn service lb bug or not?
- is depended on by
-
OCPBUGS-27852 ovnkube-controller bug: ovn service lb still has the endpoint when pod is in terminating state
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update