Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28847

Windows Nodes go to not ready due to empty cni config file

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.14.z
    • 4.16.0
    • Windows Containers
    • None
    • No
    • 0
    • WINC - Sprint 249
    • 1
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      This is a clone of issue OCPBUGS-28700. The following is the description of the original issue:
      โ€”
      This is a clone of issue OCPBUGS-28575. The following is the description of the original issue:
      โ€”
      Description of problem:

      Windows Nodes are occasionally failing to reach the ready state due to CNI issues
          

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Able to reproduce in AWS
          

      Steps to Reproduce:

          1. SSH onto the Windows Node
          2. In powershell run: rm /k/cni/config/cni.conf; New-Item /k/cni/config/cni.conf -type file
          3. Wait for node to go to not ready
          

      Actual results:

      Windows Node goes to NotReady, and does not go back to Ready, the following can be seen in the logs:
      
      WMCO log:
      2024-01-29T06:09:15Z	DEBUG	wc 10.0.193.178	run	{"cmd": "C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator", "out": "I0129 06:07:50.347432    4868 cleanup.go:111] error getting services ConfigMap associated with version annotation, falling back to use latest services ConfigMap: node is missing version annotation\nI0129 06:09:15.448458    4868 cleanup.go:159] removed services: [\"kube-proxy\" \"csi-proxy\" \"hybrid-overlay-node\" \"windows_exporter\" \"kubelet\" \"containerd\"]\n"}
      
      kubelet log:
      I0129 06:06:16.789966    4484 kubelet.go:2393] "Container runtime status" status="Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
      E0129 06:06:16.789966    4484 kubelet.go:2396] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
      
      Node events are generated:
                          {
                              "lastHeartbeatTime": "2024-01-29T06:07:53Z",
                              "lastTransitionTime": "2024-01-29T05:33:40Z",
                              "message": "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized",
                              "reason": "KubeletNotReady",
                              "status": "False",
                              "type": "Ready"
                          }
          

      Expected results:

      Windows Node goes to NotReady for less than 5 minutes, before returning to Ready
          

      Additional info:

      Has only been observed after the merge of https://github.com/openshift/windows-machine-config-operator/pull/1979
      Manual workaround:  SSH onto the Windows VM, delete /k/cni/config/cni.conf and wait a few minutes
          

            [OCPBUGS-28847] Windows Nodes go to not ready due to empty cni config file

            Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            OpenShift Jira Automation Bot added a comment - Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: Red Hat OpenShift for Windows Containers 9.0.1 security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:1203

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: Red Hat OpenShift for Windows Containers 9.0.1 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1203

            Verified on 9.0.1-9038172, Windows node get in Ready status after 1 minute

            Aharon Rasouli added a comment - Verified on 9.0.1-9038172, Windows node get in Ready status after 1 minute

            Hi sebsoto,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi sebsoto , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            In builds
            Build NVR (bundle)  = windows-machine-config-operator-bundle-container-v9.0.1-15
            Build NVR           = windows-machine-config-operator-container-9.0.1-16

            Russell Teague added a comment - In builds Build NVR (bundle)  = windows-machine-config-operator-bundle-container-v9.0.1-15 Build NVR           = windows-machine-config-operator-container-9.0.1-16

            CPaaS Service Account mentioned this issue in a merge request of openshift-winc-midstream / openshift-winc-midstream on branch rhaos-4.14-rhel-9_upstream_31d3f047dcb14dc95f497a8c688650ec:

            Updated US source to: 9038172 Merge pull request #2053 from openshift-cherrypick-robot/cherry-pick-2052-to-release-4.14

            GitLab CEE Bot added a comment - CPaaS Service Account mentioned this issue in a merge request of openshift-winc-midstream / openshift-winc-midstream on branch rhaos-4.14-rhel-9_ upstream _31d3f047dcb14dc95f497a8c688650ec : Updated US source to: 9038172 Merge pull request #2053 from openshift-cherrypick-robot/cherry-pick-2052-to-release-4.14

              rh-ee-ssoto Sebastian Soto
              openshift-crt-jira-prow OpenShift Prow Bot
              Weinan Liu Weinan Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: