Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65507

SR-IOV Operator: ConfigMap Cleanup Logic Missing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.21
    • Networking / SR-IOV
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The SR-IOV Network Operator does **NOT** remove stale policy entries from the `device-plugin-config` ConfigMap when `SriovNetworkNodePolicy` resources are deleted from the Kubernetes API.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

       100%   

      Steps to Reproduce:

      1. **Create a test policy:**
         ```bash
         oc apply -f - <<EOF
         apiVersion: sriovnetwork.openshift.io/v1
         kind: SriovNetworkNodePolicy
         metadata:
           name: test-configmap-cleanup-bug
           namespace: openshift-sriov-network-operator
         spec:
           resourceName: testcleanupbug
           nodeSelector:
             kubernetes.io/hostname: <worker-node-name>
           numVfs: 4
           nicSelector:
             pfNames:
               - <pf-name>#2-3
           deviceType: netdevice
         EOF
         ```2. **Verify policy appears in ConfigMap:**
         ```bash
         oc get configmap device-plugin-config -n openshift-sriov-network-operator \
           -o jsonpath='{.data.<node-name>}' | jq '.resourceList[] | select(.resourceName == "testcleanupbug")'
         ```
         Expected: Policy entry should appear in ConfigMap within seconds.3. **Record ConfigMap resourceVersion:**
         ```bash
         BEFORE_VERSION=$(oc get configmap device-plugin-config -n openshift-sriov-network-operator \
           -o jsonpath='{.metadata.resourceVersion}')
         echo "Before deletion: $BEFORE_VERSION"
         ```4. **Delete the policy:**
         ```bash
         oc delete sriovnetworknodepolicy test-configmap-cleanup-bug -n openshift-sriov-network-operator
         ```5. **Verify policy is deleted from Kubernetes API:**
         ```bash
         oc get sriovnetworknodepolicy test-configmap-cleanup-bug -n openshift-sriov-network-operator
         ```
         Expected: `NotFound` error (policy deleted from API).6. **Wait 5 minutes and check ConfigMap:**
         ```bash
         # Wait 5 minutes
         sleep 300
         
         # Check if stale entry still exists
         oc get configmap device-plugin-config -n openshift-sriov-network-operator \
           -o jsonpath='{.data.<node-name>}' | jq '.resourceList[] | select(.resourceName == "testcleanupbug")'
         
         # Check resourceVersion
         AFTER_VERSION=$(oc get configmap device-plugin-config -n openshift-sriov-network-operator \
           -o jsonpath='{.metadata.resourceVersion}')
         echo "After deletion: $AFTER_VERSION"
         ```     
      
      ### Reproducibility Details- **Frequency**: 100% of the time
      - **Conditions**: 
        - Any OpenShift cluster with SR-IOV operator installed
        - Any `SriovNetworkNodePolicy` that has been created and then deleted
        - No special conditions required
      - **Variations**: None observed - bug is consistent
      - **Time to Manifest**: Immediate (stale entry visible within seconds of deletion)
      - **Persistence**: Indefinite (until manual cleanup or operator restart)### Automated Reproduce ScriptA complete reproduce script is included in the bug report package:
      - **Script**: `reproduce_configmap_cleanup_bug.sh`
      - **Log**: `bug_reproduce_log_clean.txt`
      - **Usage**: `./reproduce_configmap_cleanup_bug.sh`The script automates all steps above and provides detailed logging.

      Actual results:

      -  Policy deleted from Kubernetes API (correct)
      -  Policy entry remains in ConfigMap (BUG)
      -  ConfigMap resourceVersion: `897556` → `897556` (no change)
      -  Stale entry persists for 5+ minutes (and indefinitely until manual intervention)

      Expected results:

      -  **Stale entry remains in ConfigMap** - Policy entry `testcleanupbug` is still present
      -  **ConfigMap resourceVersion unchanged** - `$BEFORE_VERSION == $AFTER_VERSION` (ConfigMap not updated)
      -  **VF resources remain claimed** - Device plugin continues to advertise resources for deleted policy
      
      
      

      Additional info:

      ### Note: this bug report was generated by Cursor ###
      
      ## Bug Summary
      The SR-IOV Network Operator does **NOT** remove stale policy entries from the `device-plugin-config` ConfigMap when `SriovNetworkNodePolicy` resources are deleted from the Kubernetes API.
      
      ## Evidence
      
      ### 1. Web Search Confirmation
      Multiple web search results confirm:
      > "The operator does not automatically delete ConfigMaps when a `SriovNetworkNodePolicy` is removed. This behavior can lead to stale ConfigMaps remaining in the system, which may cause conflicts or inconsistencies."
      
      ### 2. Source Code Analysis
      - **Constants File**: Defines `ConfigMapName = "device-plugin-config"` (found in `pkg/consts/constants.go`)
      - **No Cleanup Logic**: No code found in vendor directory that handles ConfigMap cleanup on policy deletion
      - **Expected Location**: ConfigMap update logic should be in `controllers/` directory (not in vendor)
      
      ### 3. Observed Behavior
      - Policy `testcve` (resourceName: `231e810`) was deleted from Kubernetes API
      - Policy `231e810` still exists in `device-plugin-config` ConfigMap
      - ConfigMap resourceVersion unchanged: `897556` (not being updated)
      - New policy `e810xxv231` cannot get VF resources because `231e810` still claims them
      
      ### 4. Device Plugin Logs
      ```
      I1112 02:08:19.787052       1 manager.go:121] Creating new ResourcePool: 231e810
      I1112 02:08:19.788930       1 manager.go:156] New resource server is created for 231e810 ResourcePool
      I1112 02:08:19.790606       1 manager.go:121] Creating new ResourcePool: e810xxv231
      I1112 02:08:19.793409       1 manager.go:142] no devices in device pool, skipping creating resource server for e810xxv231
      ```Device plugin creates resource pool for `231e810` (stale policy) but finds "no devices" for `e810xxv231` because `231e810` already claimed the VFs.
      
      ## Root CauseThe operator's reconciliation logic handles:
      - ✅ Policy creation → Add to ConfigMap
      - ✅ Policy update → Update ConfigMap  
      - ❌ **Policy deletion → MISSING: Should remove from ConfigMap but doesn't**The ConfigMap reconciliation is incomplete - it only handles CREATE and UPDATE operations, but not DELETE.
      
      
      ## Workaround
      
      1. Manually edit ConfigMap to remove stale entries
      2. Restart operator to force reconciliation
      3. Delete and recreate the ConfigMap (not recommended)
      
      ## Recommended Fix
      
      The operator's policy controller (likely in `controllers/` directory) needs to:
      1. Watch for `SriovNetworkNodePolicy` deletions
      2. On deletion, update `device-plugin-config` ConfigMap to remove the deleted policy's `resourceList` entry
      3. Trigger device plugin reconciliation
      
      ## Repository Reference
      
      - GitHub: https://github.com/openshift/sriov-network-operator
      - ConfigMap constant: `pkg/consts/constants.go:ConfigMapName = "device-plugin-config"`
      - Expected fix location: `controllers/` directory (policy controller)
      
      ## Environment
      
      - OpenShift Cluster
      - SR-IOV Operator namespace: `openshift-sriov-network-operator`
      - ConfigMap: `device-plugin-config`
      - Stale Policy: `231e810` (from deleted policy `testcve`)
      - Affected Node: `anl231.sriov.openshift-qe.sdn.com`  

              bnemeth@redhat.com Balazs Nemeth
              zfang@redhat.com Zhiqiang Fang
              None
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: