-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.12
-
Moderate
-
None
-
False
-
Description of problem:
Overriding OSImageURL via MachineConfig seems to sometimes result in the kubelet "flapping" -- it seems to start up, time out, start up again. This seems to correlate to a mismatch between the base image used for the override container and the base image the cluster is using, but I can't prove exactly how yet.
Version-Release number of selected component (if applicable):
How reproducible:
Able to reproduce every time I've tried overriding with an "older" image. I can't reproduce it on an image based on the `rhel-coreos-8` image the cluster was built with.
Steps to Reproduce:
1. Build an OpenShift 4.12 cluster using a nightly 2. Apply a machineconfig like the following: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-external-image-worker spec: osImageURL: "quay.io/jkyros/derived-images:ybettan-image-that-makes-kubelet-flap" (The Dockerfile for image should be https://github.com/ybettan/coreos-layering-driver-toolkit/blob/e27b82a47dae3c46f8dfcfb9c4c8dc6c9322fc73/in-an-ocp-cluster/container-image/Dockerfile#L13, all it does it add some files to /etc/, but you can build your own, just FROM something older than your current cluster) 3. Wait for your nodes to rebase to the new image and reboot 4. Watch for your nodes to pop in and out of "NotReady": while true ; do sleep 2; oc get node | grep NotReady ; done 5. Kubelet log seems to indicate repeated timeouts and restarts: Sep 07 17:03:46.444712 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: start operation timed out. Terminating. Sep 07 17:03:46.445422 jkyros-with-extension-ldk7b-master-2 kubenswrapper[1561]: I0907 17:03:46.445394 1561 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca .crt" Sep 07 17:03:46.469193 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Failed with result 'timeout'. Sep 07 17:03:46.469849 jkyros-with-extension-ldk7b-master-2 systemd[1]: Failed to start Kubernetes Kubelet. Sep 07 17:03:46.470099 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 9.343s CPU time Sep 07 17:03:56.694691 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart. Sep 07 17:03:56.695180 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1. Sep 07 17:03:56.695948 jkyros-with-extension-ldk7b-master-2 systemd[1]: Stopped Kubernetes Kubelet. Sep 07 17:03:56.696036 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 0 CPU time
Actual results:
Kublet will "flap", going back and forth between available and not available.
Expected results:
Kubelet is stable.
Additional info:
I don't know if it's a red herring, but every time this seems to happen, packages have been downgraded. I haven't narrowed it down to a specific package yet, and I don't understand the exact mechanism. The kubelet itself doesn't get downgraded, but I could see cri-o being relevant. Staging deployment...done Downgraded: NetworkManager 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 NetworkManager-cloud-setup 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 NetworkManager-libnm 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 NetworkManager-ovs 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 NetworkManager-team 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 NetworkManager-tui 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6 containers-common 2:1-27.rhaos4.12.el8 -> 2:1-22.rhaos4.11.el8 cri-o 1.25.0-51.rhaos4.12.git315a0cb.el8 -> 1.25.0-33.rhaos4.12.gitda7b5b1.el8 cri-tools 1.25.0-1.el8 -> 1.24.2-5.el8 curl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3 libcurl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3 open-vm-tools 11.3.5-1.el8_6.1 -> 11.3.5-1.el8 openshift-clients 4.12.0-202209061108.p0.g427ed14.assembly.stream.el8 -> 4.12.0-202208191215.p0.ge0f8e21.assembly.stream.el8 openshift-hyperkube 4.12.0-202209022108.p0.gebabf6d.assembly.stream.el8 -> 4.12.0-202208161547.p0.ged93380.assembly.stream.el8 openvswitch2.17 2.17.0-37.1.el8fdp -> 2.17.0-31.el8fdp ostree 2022.2-5.el8 -> 2022.1-2.el8 ostree-grub2 2022.2-5.el8 -> 2022.1-2.el8 ostree-libs 2022.2-5.el8 -> 2022.1-2.el8 podman 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8 podman-catatonit 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8 rsync 3.1.3-14.el8_6.3 -> 3.1.3-14.el8_6.2 selinux-policy 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1 selinux-policy-targeted 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1 systemd 239-58.el8_6.4 -> 239-58.el8_6.3 systemd-journal-remote 239-58.el8_6.4 -> 239-58.el8_6.3 systemd-libs 239-58.el8_6.4 -> 239-58.el8_6.3 systemd-pam 239-58.el8_6.4 -> 239-58.el8_6.3 systemd-udev 239-58.el8_6.4 -> 239-58.el8_6.3 tzdata 2022c-1.el8 -> 2022a-1.el8
- relates to
-
MCO-298 [spike] layering integration / preflight checks design
- To Do