Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-67277

Kubernetes NMState Operator cannot work on a cluster which has defaultScheduler

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • "openshift-nmstate" namespace is now annotated with `openshift.io/node-selector: ""`. Thanks to this, deployments using custom scheduler do not need any manual action to run correctly.
    • None
    • None
    • None
    • None

      Description of problem:

      We noticed that some of nmstate pods are being stuck at Pending state.

      $ oc get pods -n openshift-nmstate -o wide
      NAME                                    READY   STATUS    RESTARTS   AGE     IP              NODE      NOMINATED NODE   READINESS GATES
      nmstate-console-plugin-8b88fbd7-qz7hd   1/1     Running   0          4m20s   10.128.56.60    worker0   <none>           <none>
      nmstate-handler-5crh9                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-handler-6zndl                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-handler-brq44                   1/1     Running   0          4m21s   172.19.90.140   worker2   <none>           <none>
      nmstate-handler-hl9ls                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-handler-lgxq5                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-handler-mwr55                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-handler-rdkhs                   1/1     Running   0          4m22s   172.19.90.139   worker1   <none>           <none>
      nmstate-handler-sv5zm                   1/1     Running   0          4m21s   172.19.90.138   worker0   <none>           <none>
      nmstate-handler-v85qg                   0/1     Pending   0          4m22s   <none>          <none>    <none>           <none>
      nmstate-metrics-75c64559db-xm9kx        2/2     Running   0          4m22s   10.128.64.58    worker1   <none>           <none>
      nmstate-operator-7844c5895f-cc6jx       1/1     Running   0          13m     10.128.0.132    master0   <none>           <none>
      nmstate-webhook-6dccbdf6bd-54ggc        1/1     Running   0          4m22s   10.128.64.59    worker1   <none>           <none>
      nmstate-webhook-6dccbdf6bd-jc2nm        1/1     Running   0          4m22s   10.128.56.59    worker0   <none>           <none>
      

      This is a similar issue as OCPBUGS-58038, CNV-71397, OCPBUGS-9767, OCPBUGS-22305, etc.
      The root cause is that the namespace doesn't have `openshift.io/node-selector: ""` annotation.

      What is troublesome for customers is that the nmstate operator deletes this annotation if they added it...

      Version-Release number of selected component (if applicable):

      • kubernetes-nmstate-operator.4.20.0-202511181524

      How reproducible:

      Always

      Steps to Reproduce:

      Step1. Configure defaultNodeSelector as that normal pods are placed only on worker nodes.

      $ oc edit scheduler
      apiVersion: config.openshift.io/v1
      kind: Scheduler
      metadata:
        ...
      spec:
        defaultNodeSelector: node-role.kubernetes.io/worker=
        ...
      

      Step2. Install nmstate operator. Please don't forget add `openshift.io/node-selector: ""` in the namespace yaml.

      apiVersion: v1
      kind: Namespace
      metadata:
        annotations:
          openshift.io/node-selector: ""
        labels:
          kubernetes.io/metadata.name: openshift-nmstate
          name: openshift-nmstate
        name: openshift-nmstate
      spec:
        finalizers:
        - kubernetes
      ---
      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        annotations:
          olm.providedAPIs: NMState.v1.nmstate.io
        name: openshift-nmstate
        namespace: openshift-nmstate
      spec:
        targetNamespaces:
        - openshift-nmstate
      ---
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        labels:
          operators.coreos.com/kubernetes-nmstate-operator.openshift-nmstate: ""
        name: kubernetes-nmstate-operator
        namespace: openshift-nmstate
      spec:
        channel: stable
        installPlanApproval: Automatic
        name: kubernetes-nmstate-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      

      Step3. Define NMstate object after the operator installation completed.

      apiVersion: nmstate.io/v1
      kind: NMState
      metadata:
        name: nmstate
      EOF
      

      Step4. Run `oc get pods -n openshift-nmstate -o wide`.

      Actual results:

      Some of pods are being stuck as Pending state.

      You notice that `openshift.io/node-selector: ""` was eliminated by the operator...

      $ oc get namespace -o yaml openshift-nmstate | grep nodeSelector || echo "Not Found"
      Not Found
      

      Expected results:

      All pods are running.

      Additional information:

      The only workaround is that to add `openshift.io/node-selector: ""` manually then delete all pods.
      However, the operator deletes the annotation after a few minites. So, pods will be Pending again if they were recreated by some reason, e.g, openshift ugprade.

      This is a big defect for users who use defaultNodeSelector.
      All of our customers use defaultNodeSelector for avoiding to schedule their pods on other than worker nodes..

              mkowalsk@redhat.com Mat Kowalski
              rh-openshift OpenShift engineer NEC
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: