Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list
1. all kinds of add/delete port to/from default deny port group failures, possible symptoms:
- port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed
- port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped
- db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done
2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed
3. handle error when getting local pod from the cache fails, possible symptoms
- "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message
- pod is not added to netpol port groups, network policy is not applied
4. creating deleted namespace via ensureNamespaceLocked, symptoms:
- namespace was deleted, but address set is present in the db
5. policy acl loglevel update wasn’t applied, possible symptoms:
- netpol acl log level isn’t set/updated to namespace loglevel
6. netpol cleanup failures, symptoms:
- network policy failed to be deleted, something is still left in the db, error messages like
- "failed to destroy network policy"
- "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy"
7. concurrent write to sets.String - this will panic, you won’t miss
8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g.
- "peer AddressSet is nil, cannot add <object>"
9. host network and completed pods selected by network policy can produce error logs, no real harm
- "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err"
10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak
11. add local pod failure, since netpol port group is not committed to db yet, error looks like
- "Failed to create *factory.localPodSelector <name>, error: object not found"