Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12, 4.11, 4.10
Component/s: Networking / ovn-kubernetes
Labels:
None

Regression:
None
Sprint:
SDN Sprint 228, SDN Sprint 229
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.10.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list

1. all kinds of add/delete port to/from default deny port group failures, possible symptoms:
  - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed
  - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped
  - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done
2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed
3. handle error when getting local pod from the cache fails, possible symptoms
  - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message
  - pod is not added to netpol port groups, network policy is not applied
4. creating deleted namespace via ensureNamespaceLocked, symptoms:
  - namespace was deleted, but address set is present in the db
5. policy acl loglevel update wasn’t applied, possible symptoms:
  - netpol acl log level isn’t set/updated to namespace loglevel
6. netpol cleanup failures, symptoms:
  - network policy failed to be deleted, something is still left in the db, error messages like
  - "failed to destroy network policy"
  - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy"
7. concurrent write to sets.String - this will panic, you won’t miss
8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g.
  - "peer AddressSet is nil, cannot add <object>"
9. host network and completed pods selected by network policy can produce error logs, no real harm
  - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err"
10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak
11. add local pod failure, since netpol port group is not committed to db yet, error looks like
  - "Failed to create *factory.localPodSelector <name>, error: object not found"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Example 1
1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              project: myproject

2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time
3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times
4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling
5. With the new version no error messages should be present

Example 2
1. create network policy that applies to a namespace test
piVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
2. Create host network pod in namespace test
3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: "
4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy"
5. With the new version no error message should be present

All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.

Actual results:

Expected results:

Additional info:

Only some parts were backported to 4.10 due to significant releases differences.
The problems that are fixed, and perf improvement:
1. don't retry unscheduled pod, wait for update event instead. 
2.  Cleanup pod handler for namespaceAndPod handler on namespace delete
event.
3. Only update localPods after successful db transaction, 
return fast from localPod handlers based on `np.localPods`
4. Don't retry fetching lsp from the lspCache, that "stops" the handler for 1 second
5.  Use stored portUUID to delete local pods instead of getting that info
from lspCache

clones

OCPBUGS-3077 [4.11] ovn-k network policy races

Closed

is blocked by

OCPBUGS-3077 [4.11] ovn-k network policy races

Closed

links to

openshift/ovn-kubernetes#1434: OCPBUGS-3778: [release-4.10] Netpol races partial fixes

Assignee:: Nadia Pinaeva

Reporter:: Nadia Pinaeva

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/11/16 4:01 PM

Updated:: 2023/01/04 6:48 PM

Resolved:: 2023/01/04 6:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates