Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59290

[4.18] iommu.passthrough for Arm64 GH nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.18
    • Node Tuning Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: What actions or circumstances cause this bug to present.
      *Consequence*: What happens when the bug presents.
      *Fix*: What was done to fix the bug.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      Description of problem:
      Back in 4.16.30 on Arm64 GraceHopper nodes in order for NVIDIA GPU validator to properly work when a performance profile was set on the system the following patch needed to be set:

      apiVersion: tuned.openshift.io/v1
      kind: Tuned
      metadata:
        name: performance-patch
        namespace: openshift-cluster-node-tuning-operator
      spec:
        profile:
        - data: |
            [main]
            summary=Configuration changes profile inherited from performance created tuned
            include=openshift-node-performance-openshift-node-performance-profile
            [bootloader]
            cmdline_iommu_arm=-iommu.passthrough=1
            [service]
            service.stalld=start,enable
          name: performance-patch
        recommend:
        - machineConfigLabels:
            machineconfiguration.openshift.io/role: master
          priority: 19
          profile: performance-patch
      

      This is highlighted in KCS: https://access.redhat.com/solutions/7107635

      However in 4.18 the above does not work when using SRIOV due to a recent commit in SRIOV: https://github.com/openshift/sriov-network-operator/blob/release-4.18/pkg/plugins/generic/generic_plugin.go#L441

      Instead the following patch was required:

      data: |
             [main]
             summary=Additional Cloud 5G RAN Application tuning
             include=performance-patch
             [bootloader]
             # see https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.18/assets/performanceprofile/tuned/openshift-node-performance#L172
             cmdline_hugepages=default_hugepagesz=1G hugepagesz=1G hugepages=32
             # DOES NOT WORK: based on KCS https://access.redhat.com/solutions/7107635 for GPU operator
             # cmdline_iommu_arm=-iommu.passthrough=1
             cmdline_iommu=-iommu.passthrough=1
             cmdline_iommu=+ iommu.passthrough=0
      

      We need a consistent patch method to ensure the validator issue is not hit.

      Version-Release number of selected component (if applicable):4.18

      How reproducible:
      100%

      Steps to Reproduce:
      1. Install OCP
      2. Install SRIOV + Performance Profile
      3. Install NVIDIA GPU Operator and Cluster policy

      Actual results:
      Validator fails for GPU operator unless patch above is applied

      Expected results:
      GPU validator should just work

      Additional info:

              msivak@redhat.com Martin Sivak
              rh-ee-bschmaus Ben Schmaus
              None
              Andrea Panattoni, Brent Rowsell
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: