Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49767

Node Feature Discovery creates thousands of nodefeatures.nfd.openshift.io objects in Azure with NVIDIA GPU Operator

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      With normal / expected churn of nodes due to node autoscaler in Azure (ROSA) using NVIDIA GPU Operator and IBM Watson X, over 3500 objects of type `nodefeatures.nfd.openshift.io` were created driving `kube-apiserver` memory utilization too high and the node feature operator was reporting that it could not locate nodes referenced in those objects.
      
      This was ultimately resolved by removing all of the objects that were not automatically pruned when the nodes were no longer part of the cluster. It is possible this was exacerbated by ArgoCD performing recursive API queries for objects with groups of requests taking up to 30s to complete.

              yshnaidm Yevgeny Shnaidman
              rhn-gps-bbeaudoi Brian Beaudoin
              Guy Gordani Guy Gordani
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: