-
Bug
-
Resolution: Done
-
Major
-
None
-
4.14, 4.15
-
Important
-
No
-
0
-
OSDOCS Sprint 261
-
1
-
False
-
-
Release Note Not Required
-
In Progress
Description of problem:
The yaml under "Create a NodeFeatureDiscovery instance using the CLI" does not work. If you follow the steps using the Console everything works fine.
Version-Release number of selected component (if applicable):
How reproducible:
Follow the instructions in the document for creating the NFD instance using the CLI, then perform the steps to ensure all is working correctly. It will not be. It will fail 100% of the time.
Steps to Reproduce:
1. Start with the NVIDA docs here for installing the Node Feature Discovery Operator 2. The NVIDIA docs will point you to the OpenShift docs here: https://docs.openshift.com/container-platform/4.15/hardware_enablement/psap-node-feature-discovery-operator.html. Follow the instructions for creating the operator using the CLI 3. Go back to the NDVIDA docs and follow the steps to ensure the NFD operator was installed correctly: https://docs.nvidia.com/datacenter/cloud-native/openshift/23.9.1/install-nfd.html Under "Verify that the Node Feature Discovery Operator is functioning correctly", you will not get the correct label on the Node. Running this command will also not generate the correct output "oc describe node | egrep 'Roles|pci' | grep -v master". Finally, continuing with the installation of the NVIDIA GPU Operator will also result in issues due the NFD instance problem. USING THE CONSOLE TO CREATE THE NFD OPERATOR WORKS FINE!
Actual results:
NFD Instance Operator did not deploy correctly. Node labels were incorrect.
Expected results:
Expected the GPU Node to have this label: "feature.node.kubernetes.io/pci-10de.present=true", as referenced in the documentation.
Additional info:
USING THE CONSOLE TO CREATE THE NFD OPERATOR WORKS FINE! 100% of the time. I took NFD instance yaml generated from a successful Console deployment and then followed the CLI install steps using that NFD instance yaml and all worked fine, including the deployment of the GPU Operator. Here are the differences. The code below is from the NDF instance yaml that does not work. kernel: kconfigFile: "/path/to/kconfig" configOpts: - "NO_HZ" - "X86" - "DMI" pci: deviceClassWhitelist: - "0200" - "03" - "12" deviceLabelFields: - "class" customConfig: configData: | - name: "more.kernel.features" matchOn: - loadedKMod: ["example_kmod3"] The yaml below does work. Note that most of the code is commented out. # - "X86" # - "DMI" pci: deviceClassWhitelist: - "0200" - "03" - "12" deviceLabelFields: # - "class" - "vendor" # - "device" # - "subsystem_vendor" # - "subsystem_device" # usb: # deviceClassWhitelist: # - "0e" # - "ef" # - "fe" # - "ff" # deviceLabelFields: # - "class" # - "vendor" # - "device" # custom: # - name: "my.kernel.feature" # matchOn: # - loadedKMod: ["example_kmod1", "example_kmod2"] # - name: "my.pci.feature" # matchOn: # - pciId: # class: ["0200"] # vendor: ["15b3"] # device: ["1014", "1017"] # - pciId : # vendor: ["8086"] # device: ["1000", "1100"] # - name: "my.usb.feature" # matchOn: # - usbId: # class: ["ff"] # vendor: ["03e7"] # device: ["2485"] # - usbId: # class: ["fe"] # vendor: ["1a6e"] # device: ["089a"] # - name: "my.combined.feature" # matchOn: # - pciId: # vendor: ["15b3"] # device: ["1014", "1017"] # loadedKMod : ["vendor_kmod1", "vendor_kmod2"]
- duplicates
-
OCPBUGS-32427 [enterprise-4.15] Issue in file hardware_enablement/psap-node-feature-discovery-operator.adoc
- Closed