-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.21
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
The SR-IOV Network Operator does **NOT** remove stale policy entries from the `device-plugin-config` ConfigMap when `SriovNetworkNodePolicy` resources are deleted from the Kubernetes API.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. **Create a test policy:**
```bash
oc apply -f - <<EOF
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: test-configmap-cleanup-bug
namespace: openshift-sriov-network-operator
spec:
resourceName: testcleanupbug
nodeSelector:
kubernetes.io/hostname: <worker-node-name>
numVfs: 4
nicSelector:
pfNames:
- <pf-name>#2-3
deviceType: netdevice
EOF
```2. **Verify policy appears in ConfigMap:**
```bash
oc get configmap device-plugin-config -n openshift-sriov-network-operator \
-o jsonpath='{.data.<node-name>}' | jq '.resourceList[] | select(.resourceName == "testcleanupbug")'
```
Expected: Policy entry should appear in ConfigMap within seconds.3. **Record ConfigMap resourceVersion:**
```bash
BEFORE_VERSION=$(oc get configmap device-plugin-config -n openshift-sriov-network-operator \
-o jsonpath='{.metadata.resourceVersion}')
echo "Before deletion: $BEFORE_VERSION"
```4. **Delete the policy:**
```bash
oc delete sriovnetworknodepolicy test-configmap-cleanup-bug -n openshift-sriov-network-operator
```5. **Verify policy is deleted from Kubernetes API:**
```bash
oc get sriovnetworknodepolicy test-configmap-cleanup-bug -n openshift-sriov-network-operator
```
Expected: `NotFound` error (policy deleted from API).6. **Wait 5 minutes and check ConfigMap:**
```bash
# Wait 5 minutes
sleep 300
# Check if stale entry still exists
oc get configmap device-plugin-config -n openshift-sriov-network-operator \
-o jsonpath='{.data.<node-name>}' | jq '.resourceList[] | select(.resourceName == "testcleanupbug")'
# Check resourceVersion
AFTER_VERSION=$(oc get configmap device-plugin-config -n openshift-sriov-network-operator \
-o jsonpath='{.metadata.resourceVersion}')
echo "After deletion: $AFTER_VERSION"
```
### Reproducibility Details- **Frequency**: 100% of the time
- **Conditions**:
- Any OpenShift cluster with SR-IOV operator installed
- Any `SriovNetworkNodePolicy` that has been created and then deleted
- No special conditions required
- **Variations**: None observed - bug is consistent
- **Time to Manifest**: Immediate (stale entry visible within seconds of deletion)
- **Persistence**: Indefinite (until manual cleanup or operator restart)### Automated Reproduce ScriptA complete reproduce script is included in the bug report package:
- **Script**: `reproduce_configmap_cleanup_bug.sh`
- **Log**: `bug_reproduce_log_clean.txt`
- **Usage**: `./reproduce_configmap_cleanup_bug.sh`The script automates all steps above and provides detailed logging.
Actual results:
- Policy deleted from Kubernetes API (correct) - Policy entry remains in ConfigMap (BUG) - ConfigMap resourceVersion: `897556` → `897556` (no change) - Stale entry persists for 5+ minutes (and indefinitely until manual intervention)
Expected results:
- **Stale entry remains in ConfigMap** - Policy entry `testcleanupbug` is still present - **ConfigMap resourceVersion unchanged** - `$BEFORE_VERSION == $AFTER_VERSION` (ConfigMap not updated) - **VF resources remain claimed** - Device plugin continues to advertise resources for deleted policy
Additional info:
### Note: this bug report was generated by Cursor ### ## Bug Summary The SR-IOV Network Operator does **NOT** remove stale policy entries from the `device-plugin-config` ConfigMap when `SriovNetworkNodePolicy` resources are deleted from the Kubernetes API. ## Evidence ### 1. Web Search Confirmation Multiple web search results confirm: > "The operator does not automatically delete ConfigMaps when a `SriovNetworkNodePolicy` is removed. This behavior can lead to stale ConfigMaps remaining in the system, which may cause conflicts or inconsistencies." ### 2. Source Code Analysis - **Constants File**: Defines `ConfigMapName = "device-plugin-config"` (found in `pkg/consts/constants.go`) - **No Cleanup Logic**: No code found in vendor directory that handles ConfigMap cleanup on policy deletion - **Expected Location**: ConfigMap update logic should be in `controllers/` directory (not in vendor) ### 3. Observed Behavior - Policy `testcve` (resourceName: `231e810`) was deleted from Kubernetes API - Policy `231e810` still exists in `device-plugin-config` ConfigMap - ConfigMap resourceVersion unchanged: `897556` (not being updated) - New policy `e810xxv231` cannot get VF resources because `231e810` still claims them ### 4. Device Plugin Logs ``` I1112 02:08:19.787052 1 manager.go:121] Creating new ResourcePool: 231e810 I1112 02:08:19.788930 1 manager.go:156] New resource server is created for 231e810 ResourcePool I1112 02:08:19.790606 1 manager.go:121] Creating new ResourcePool: e810xxv231 I1112 02:08:19.793409 1 manager.go:142] no devices in device pool, skipping creating resource server for e810xxv231 ```Device plugin creates resource pool for `231e810` (stale policy) but finds "no devices" for `e810xxv231` because `231e810` already claimed the VFs. ## Root CauseThe operator's reconciliation logic handles: - ✅ Policy creation → Add to ConfigMap - ✅ Policy update → Update ConfigMap - ❌ **Policy deletion → MISSING: Should remove from ConfigMap but doesn't**The ConfigMap reconciliation is incomplete - it only handles CREATE and UPDATE operations, but not DELETE. ## Workaround 1. Manually edit ConfigMap to remove stale entries 2. Restart operator to force reconciliation 3. Delete and recreate the ConfigMap (not recommended) ## Recommended Fix The operator's policy controller (likely in `controllers/` directory) needs to: 1. Watch for `SriovNetworkNodePolicy` deletions 2. On deletion, update `device-plugin-config` ConfigMap to remove the deleted policy's `resourceList` entry 3. Trigger device plugin reconciliation ## Repository Reference - GitHub: https://github.com/openshift/sriov-network-operator - ConfigMap constant: `pkg/consts/constants.go:ConfigMapName = "device-plugin-config"` - Expected fix location: `controllers/` directory (policy controller) ## Environment - OpenShift Cluster - SR-IOV Operator namespace: `openshift-sriov-network-operator` - ConfigMap: `device-plugin-config` - Stale Policy: `231e810` (from deleted policy `testcve`) - Affected Node: `anl231.sriov.openshift-qe.sdn.com`