Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1035

Kubelet "flapping" after OSImageURL override/rpm-ostree rebase

XMLWordPrintable

    • Moderate
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Overriding OSImageURL via MachineConfig seems to sometimes result in the kubelet "flapping" -- it seems to start up, time out, start up again. 
      
      This seems to correlate to a mismatch between the base image used for the override container and the base image the cluster is using, but I can't prove exactly how yet. 

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Able to reproduce every time I've tried overriding with an "older" image. 
      
      I can't reproduce it on an image based on the `rhel-coreos-8` image the cluster was built with. 

      Steps to Reproduce:

      1. Build an OpenShift 4.12 cluster using a nightly
      2. Apply a machineconfig like the following: 
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: 99-external-image-worker
      spec:
        osImageURL: "quay.io/jkyros/derived-images:ybettan-image-that-makes-kubelet-flap"
      
      (The Dockerfile for image should be https://github.com/ybettan/coreos-layering-driver-toolkit/blob/e27b82a47dae3c46f8dfcfb9c4c8dc6c9322fc73/in-an-ocp-cluster/container-image/Dockerfile#L13, all it does it add some files to /etc/, but you can build your own, just FROM something older than your current cluster) 
      
      3. Wait for your nodes to rebase to the new image and reboot 
      4. Watch for your nodes to pop in and out of "NotReady": 
      
       while true ; do sleep 2;  oc get node | grep NotReady ; done
      
      5. Kubelet log seems to indicate repeated timeouts and restarts: 
      
      Sep 07 17:03:46.444712 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: start operation timed out. Terminating.
      Sep 07 17:03:46.445422 jkyros-with-extension-ldk7b-master-2 kubenswrapper[1561]: I0907 17:03:46.445394    1561 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca
      .crt"
      Sep 07 17:03:46.469193 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Failed with result 'timeout'.
      Sep 07 17:03:46.469849 jkyros-with-extension-ldk7b-master-2 systemd[1]: Failed to start Kubernetes Kubelet.
      Sep 07 17:03:46.470099 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 9.343s CPU time
      Sep 07 17:03:56.694691 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
      Sep 07 17:03:56.695180 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1.
      Sep 07 17:03:56.695948 jkyros-with-extension-ldk7b-master-2 systemd[1]: Stopped Kubernetes Kubelet.
      Sep 07 17:03:56.696036 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 0 CPU time
        

      Actual results:

      Kublet will "flap", going back and forth between available and not available. 

      Expected results:

      Kubelet is stable. 

      Additional info:

      I don't know if it's a red herring, but every time this seems to happen, packages have been downgraded. I haven't narrowed it down to a specific package yet, and I don't understand the exact mechanism. The kubelet itself doesn't get downgraded, but I could see cri-o being relevant. 
      
      Staging deployment...done
      Downgraded:
        NetworkManager 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        NetworkManager-cloud-setup 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        NetworkManager-libnm 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        NetworkManager-ovs 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        NetworkManager-team 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        NetworkManager-tui 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
        containers-common 2:1-27.rhaos4.12.el8 -> 2:1-22.rhaos4.11.el8
        cri-o 1.25.0-51.rhaos4.12.git315a0cb.el8 -> 1.25.0-33.rhaos4.12.gitda7b5b1.el8
        cri-tools 1.25.0-1.el8 -> 1.24.2-5.el8
        curl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3
        libcurl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3
        open-vm-tools 11.3.5-1.el8_6.1 -> 11.3.5-1.el8
        openshift-clients 4.12.0-202209061108.p0.g427ed14.assembly.stream.el8 -> 4.12.0-202208191215.p0.ge0f8e21.assembly.stream.el8
        openshift-hyperkube 4.12.0-202209022108.p0.gebabf6d.assembly.stream.el8 -> 4.12.0-202208161547.p0.ged93380.assembly.stream.el8
        openvswitch2.17 2.17.0-37.1.el8fdp -> 2.17.0-31.el8fdp
        ostree 2022.2-5.el8 -> 2022.1-2.el8
        ostree-grub2 2022.2-5.el8 -> 2022.1-2.el8
        ostree-libs 2022.2-5.el8 -> 2022.1-2.el8
        podman 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8
        podman-catatonit 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8
        rsync 3.1.3-14.el8_6.3 -> 3.1.3-14.el8_6.2
        selinux-policy 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1
        selinux-policy-targeted 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1
        systemd 239-58.el8_6.4 -> 239-58.el8_6.3
        systemd-journal-remote 239-58.el8_6.4 -> 239-58.el8_6.3
        systemd-libs 239-58.el8_6.4 -> 239-58.el8_6.3
        systemd-pam 239-58.el8_6.4 -> 239-58.el8_6.3
        systemd-udev 239-58.el8_6.4 -> 239-58.el8_6.3
        tzdata 2022c-1.el8 -> 2022a-1.el8

            jkyros@redhat.com John Kyros
            jkyros@redhat.com John Kyros
            Rio Liu Rio Liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: