Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36397

Kube-scheduler panics in OCP 4.14 when Pod has invalid Node selector

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.14.z, 4.15.z
    • kube-scheduler
    • None
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when a pod specified a node selector with no matching nodes, the kube-scheduler panicked with the following error: _Observed a panic: integer divide by zero_. With this release, the issue in the kube-scheduler code base is resolved, and the kube-scheduler no longer panics when a pod specifies a node selector with no matching nodes. (link:https://issues.redhat.com/browse/OCPBUGS-36397[*OCPBUGS-36397*])
      Show
      * Previously, when a pod specified a node selector with no matching nodes, the kube-scheduler panicked with the following error: _Observed a panic: integer divide by zero_. With this release, the issue in the kube-scheduler code base is resolved, and the kube-scheduler no longer panics when a pod specifies a node selector with no matching nodes. (link: https://issues.redhat.com/browse/OCPBUGS-36397 [* OCPBUGS-36397 *])
    • Bug Fix
    • In Progress

      Description of problem:

      The kube-scheduler can panic with the error "Observed a panic: "integer divide by zero" (runtime error: integer divide by zero)" when a Pod specifies a Node selector that has no matching nodes.
      
      This issue is only present in OCP 4.14, likely due to this bug being introduced and later fixed upstream [1].
      
      More specifically, I could note reproduce the issue on 4.12, 4.13, or 4.15; it seems to only occur on 4.14, which is likely post-regression and pre-bugfix.

      Version-Release number of selected component (if applicable):

      OCP 4.14

      How reproducible:

      Always

      Steps to Reproduce:
      1. Add Pod with nodeAffinity selector with no matching nodes to cluster:

      apiVersion: v1
      kind: Pod
      metadata:
        name: break-kube-scheduler
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchFields:
                - key: metadata.name
                  operator: In
                  values:
                  - invalid-node # a node that doesn't exist
        containers:
        - name: main
          image: image-registry.openshift-image-registry.svc:5000/openshift/ubi:latest
          command: ["cat"]
          stdin: true
      
      

      2. Observe kube-scheduler leader logs

      Actual results:
      Panic is observed in kube-scheduler leader, causing this Pod and all others after it (alphanumerically?) to remain unscheduled.
       
      Expected results:
      Kube-scheduler logs an error indicating no matching node:

      E0701 15:35:25.419149 1 schedule_one.go:158] "Error selecting node for pod" err="nodeinfo not found for node name \"invalid-node\"" pod="openshift-kube-scheduler/break-kube-scheduler"
      E0701 15:35:25.419196 1 schedule_one.go:891] "Error scheduling pod; retrying" err="nodeinfo not found for node name \"invalid-node\"" pod="openshift-kube-scheduler/break-kube-scheduler"
      

      Additional info:

      [1] Upstream bug on kube-scheduler panic (Kubernetes #124930)

      [2] Similar issue in OCP 4.17 (OCPBUGS-34593)

              jchaloup@redhat.com Jan Chaloupka
              rhn-support-jorbell Jordan Bell
              Rama Kasturi Narra Rama Kasturi Narra
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: