[OCPBUGS-8741] [4.13] Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10 - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Networking / ovn-kubernetes
Labels:
None

Severity:
Important
Regression:
No
Sprint:
CNF Network Sprint 233
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-5889~~. The following is the description of the original issue:
—
Description of problem:

Customer running a cluster with following config:
4.10.23
AWS/IPI
OVNKubernetes

Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.

Example:
Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
Deployment2 with 1 pod hosting a service and route exists in same namespace
Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.

Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)

Pod A remains able to reach the local gateway on the node

Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.

Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.

I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.

Additional relevant data:
- pods affects throughout cluster; no specific project/service/deployment/application
- pods ride on different nodes all the time (no one node affected)
- pods with fail condition are on same node with other pods without issue
- multiple namespaces see this problem
- all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).

Version-Release number of selected component (if applicable):

4.10.23

How reproducible:

every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

Steps to Reproduce:

1. deploy pod with at least two replicas in a namespace with allow-from same network policy
2. deploy a different service and route example httpd instance in same namespace
3. observe that one of the two pods may fail to reach service IP after some time
4. apply annotation to pod and it is immediately able to reach services again.

Actual results:

pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster.

Expected results:

pods should not lose access to service network peers.

Additional info:

see next comments for relevant uploads/sosreports and inspects.

blocks

OCPBUGS-10314 [4.12] Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10

Closed

clones

OCPBUGS-5889 Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10

Closed

is blocked by

OCPBUGS-5889 Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10

Closed

is cloned by

OCPBUGS-10314 [4.12] Pods in same deployment will have different ability to query services in same namespace from one another; ocp 4.10

Closed

links to

openshift/ovn-kubernetes#1567: OCPBUGS-8741: [release-4.13] Handle Completed pods deletion

Assignee:: Andrea Panattoni

Reporter:: OpenShift Prow Bot

QA Contact:: Arti Sood

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/03/08 5:54 PM

Updated:: 2023/05/17 10:34 PM

Resolved:: 2023/05/17 10:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide