Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65542

SR-IOV Operator: Incomplete NAD Configuration

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.21
    • Networking / SR-IOV
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary:

       

      The SR-IOV operator creates NetworkAttachmentDefinition (NAD) resources with *incomplete configuration*. When user pods try to attach to SR-IOV networks, the CNI plugin fails because critical fields (`resourceName` and `pciAddress`) are missing from the NAD's spec.config JSON (though `resourceName` is correctly placed in metadata.annotations).

      *Result*: Pods remain in Pending state with error: "SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required"

      This bug was discovered during comprehensive integration testing of the SR-IOV operator and manifests as pod networking attachment failures.    

       

      Description of problem:

      ### What Goes Wrong```
      Expected NAD spec.config:
      {
        "resourceName": "openshift.io/cx7anl244",  ✅ CNI needs this
        "pciAddress": "0000:02:01.2",              ✅ CNI needs this
        "type": "sriov"
      }Actual NAD spec.config:
      {
        "type": "sriov"
        # ❌ resourceName MISSING!
        # ❌ pciAddress MISSING!
      }BUT: resourceName IS in metadata.annotations ✅
      ```
      
      ### Impact- ❌ Pod attachment fails
      - ❌ All SR-IOV networking broken
      - ❌ Tests timeout waiting for pod readiness
      - ✅ Only manifests when creating NEW networks or after operator restart---
      
      ## When Bug Manifests
      
      **NOT in normal operation** (pre-configured networks work fine)
      
      **YES in these situations:**
      1. Creating a NEW SriovNetwork resource
      2. After operator restart/reinstallation
      3. Comprehensive testing (like your tests!)
      4. When NADs are regenerated
      
      **Why**: Operator only creates NAD when you create SriovNetwork resource
      
      ---
      
      ## Root Cause
      
      **Location**: `bindata/manifests/cni-config/sriov/` (template files)
      
      **Issue**: Template placement logic puts `resourceName` in annotations but NOT in spec.config JSON
      
      **Go Code**: ✅ CORRECT - properly prepares `data.Data["CniResourceName"]`
      
      **Templates**: ❌ BUGGY - uses it in annotations but not in CNI config 

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

      ### Auto-Reproduction ToolFile: `reproduce_incomplete_nad_bug.sh`
      
      **What it does:**
      1. Creates test namespace
      2. Creates SriovNetwork resource
      3. Captures operator logs
      4. Attempts to create test pod
      5. Collects all evidence
      6. Shows complete NAD output
      
      **How to run:**
      ```bash
      bash reproduce_incomplete_nad_bug.sh# Output: Complete NAD config in /tmp/
      ```     

      Actual results:

       ### Scenario 1: Pre-configured Networks
      ```
      Production Setup:
        Networks created long ago or pre-provided
          ↓
        Operator just uses them
          ↓
        ✅ No new NAD creation needed
          ↓
        ✅ Bug doesn't manifest
      ```
      ### Scenario 2: New Network Creation (Your Tests)
      ```
      Your Test Setup:
        Create FRESH SriovNetwork resource
          ↓
        ❌ Operator generates NEW NAD with incomplete config
          ↓
        Try to attach pods
          ↓
        ❌ CNI plugin fails
      ```
      ### Scenario 3: Operator Restart
      ```
      Production Scenario:
        Operator running (NADs exist)
          ↓
        Operator crashes/restarts
          ↓
        NADs regenerated
          ↓
        ❌ Regenerated NAD has incomplete config
          ↓
        ❌ Pods fail to attach
          ↓
        This is when bug appears in production

      Expected results:

          pods attach ok

      Additional info:

          

              bnemeth@redhat.com Balazs Nemeth
              zfang@redhat.com Zhiqiang Fang
              None
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: