Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-518

NFD logs crash approximately 7 hours after installation without touching the cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • July Release for PSAP
    • Alongside OpenShift 4.10
    • NFD
    • False
    • False

      Approximately 7 hours after NFD has been installed, the following error shows up in the NFD controller manager's logs:

      [E0903 02:52:07.981627 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)
      
      [E0903 02:52:10.566841 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)}

      This issue is easily repeatable by letting NFD run for several hours, such as overnight. I have reproduced it with and without using my cluster during that 7 hour period.

      Most likely, this issue involves the controller manager.

       

      For reference, GitHub issue here: [https://github.com/openshift/cluster-nfd-operator/issues/209]

       

      Acceptance criteria:

      • NFD no longer spits out the above error

              carangog Eduardo Arango (Inactive)
              cpacheco@redhat.com Courtney Pacheco (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: