Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.10.z
Component/s: Networking / SR-IOV
Labels:
- system-test
- telco

Severity:
Moderate
Regression:
No
Story Points:
5
Sprint:
NHE Sprint 237
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.10.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

After applying manifests for SR-IOV configuration MCP fails to update

Version-Release number of selected component (if applicable):

4.10.60

How reproducible:

100%

Steps to Reproduce:

1. Deploy baremetal OCP cluster
2. Create MCPs - 1st for regular worker nodes and 2nd for SR-IOV workloads
3. Apply manifests:
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
  namespace: openshift-sriov-network-operator
spec:
  configDaemonNodeSelector:
    node-role.kubernetes.io/sriov: ""
  enableInjector: true
  enableOperatorWebhook: true 

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: "sriovleftdpdkmellanox"
  namespace: openshift-sriov-network-operator
spec:
  resourceName: "sriovleftdpdkmellanox"
  nodeSelector:
    node-role.kubernetes.io/sriov: ""
  mtu: 9000
  numVfs: 4
  nicSelector:
    # consider switching to PCI paths
    pfNames: ['ens4f0', 'ens5f0']
  deviceType: netdevice
  isRdma: True

Actual results:

Node stuck in SchedulingDisabled and MCP is marked as `paused`

Expected results:

SR-IOV is successfully configured and nodes are rebooted

Additional info:

Notes from Sebastian:
---------------------
the problem is now we need to drain also after a reboot when the number of devices is 0
but we are not able to lock the drain as in the reboot time another node already took the lock
so we are in a dead lock

depends on

OCPBUGS-522 [release-4.11] sriov operator doesn't support golang 1.18

Closed

is cloned by

OCPBUGS-14582 [4.9.z] MCP fails to update after SR-IOV drains the node

Closed

is depended on by

OCPBUGS-14582 [4.9.z] MCP fails to update after SR-IOV drains the node

Closed

links to

openshift/sriov-network-operator#784: [release-4.10] OCPBUGS-14360: Continue node drain after reboot

openshift/sriov-network-operator#787: [release-4.9] OCPBUGS-14360: Continue node drain after reboot

Assignee:: William Zhao

Reporter:: Yurii Prokulevych

QA Contact:: Yurii Prokulevych

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/05/31 2:55 PM

Updated:: 2024/04/29 5:06 PM

Resolved:: 2023/06/22 8:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates