Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-1845

[UPSTREAM] Fix Evented PLEG issue in Kubelet

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • Upstream
    • 5
    • False
    • None
    • False
    • OCPSTRAT-296 - Openshift Kubelet: Pod Lifecycle Event Generator (PLEG)
    • OCPNODE Sprint 243 (Blue)

      Enabling the evented pleg featuregate via the machine config operator is resulting in the pods going into "CrashLoopBackOff" or "Error" state.
      MCO Branch: https://github.com/openshift/machine-config-operator/pull/3917/files 

      Reason for the pods going into the CrashLoopBackOff state is that there are duplicate containers getting created, started within the pod and hence racing out for acquiring the resources (ports).

      Ex: "bind: address already in use" error observed on many pods.

      ci-ln-09tlpi2-72292-flgxf-master-2.log:264184:Sep 18 15:39:10.886241 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         time="2023-09-18T15:39:01Z" level=fatal msg="failed to create listener: failed to listen on 0.0.0.0:5443: listen tcp 0.0.0.0:5443: bind: address already in use"
      ci-ln-09tlpi2-72292-flgxf-master-2.log:264236:Sep 18 15:39:11.154052 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         F0918 15:38:21.629025       1 cmd.go:56] failed to create listener: failed to listen on 0.0.0.0:6443: listen tcp 0.0.0.0:6443: bind: address already in use
      ci-ln-09tlpi2-72292-flgxf-master-2.log:265167:Sep 18 15:39:21.155012 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         F0918 15:38:23.445192       1 standalone_apiserver.go:120] listen tcp 0.0.0.0:8443: bind: address already in use
      ci-ln-09tlpi2-72292-flgxf-master-2.log:265182:Sep 18 15:39:21.155012 ci-ln-09tlpi2-72292-flgxf-master-2 kubenswrapper[2308]:         E0918 15:38:54.680976       1 run.go:74] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp :8443: bind: address already in use"

      The above issue has been identified and root caused - https://issues.redhat.com/browse/OCPNODE-1818 

      Fix this issue along with fixing the flakiness of the job - https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg 

      Test PR with a potential fix - https://github.com/kubernetes/kubernetes/pull/120480 

              svanka@redhat.com Sai Ramesh Vanka
              svanka@redhat.com Sai Ramesh Vanka
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: