Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-1845

[UPSTREAM] Fix Evented PLEG issue in Kubelet

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • OCPNODE Sprint 243 (Blue)

      Enabling the evented pleg featuregate via the machine config operator is resulting in the pods going into "CrashLoopBackOff" or "Error" state.
      MCO Branch: https://github.com/openshift/machine-config-operator/pull/3917/files 

      Reason for the pods going into the CrashLoopBackOff state is that there are duplicate containers getting created, started within the pod and hence racing out for acquiring the resources (ports).

      Ex: "bind: address already in use" error observed on many pods.

      ci-ln-09tlpi2-72292-flgxf-master-2.log:264184:Sep 18 15:39:10.886241 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         time="2023-09-18T15:39:01Z" level=fatal msg="failed to create listener: failed to listen on 0.0.0.0:5443: listen tcp 0.0.0.0:5443: bind: address already in use"
      ci-ln-09tlpi2-72292-flgxf-master-2.log:264236:Sep 18 15:39:11.154052 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         F0918 15:38:21.629025       1 cmd.go:56] failed to create listener: failed to listen on 0.0.0.0:6443: listen tcp 0.0.0.0:6443: bind: address already in use
      ci-ln-09tlpi2-72292-flgxf-master-2.log:265167:Sep 18 15:39:21.155012 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         F0918 15:38:23.445192       1 standalone_apiserver.go:120] listen tcp 0.0.0.0:8443: bind: address already in use
      ci-ln-09tlpi2-72292-flgxf-master-2.log:265182:Sep 18 15:39:21.155012 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         E0918 15:38:54.680976       1 run.go:74] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp :8443: bind: address already in use"

      The above issue has been identified and root caused - https://issues.redhat.com/browse/OCPNODE-1818 

      Fix this issue along with fixing the flakiness of the job - https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg 

      Test PR with a potential fix - https://github.com/kubernetes/kubernetes/pull/120480 

            svanka@redhat.com Sai Ramesh Vanka
            svanka@redhat.com Sai Ramesh Vanka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: