Loading...

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Machine Config Operator
Labels:
- layering
- mco-triaged

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Overriding OSImageURL via MachineConfig seems to sometimes result in the kubelet "flapping" -- it seems to start up, time out, start up again. 

This seems to correlate to a mismatch between the base image used for the override container and the base image the cluster is using, but I can't prove exactly how yet.

Version-Release number of selected component (if applicable):

How reproducible:

Able to reproduce every time I've tried overriding with an "older" image. 

I can't reproduce it on an image based on the `rhel-coreos-8` image the cluster was built with.

Steps to Reproduce:

1. Build an OpenShift 4.12 cluster using a nightly
2. Apply a machineconfig like the following: 

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-external-image-worker
spec:
  osImageURL: "quay.io/jkyros/derived-images:ybettan-image-that-makes-kubelet-flap"

(The Dockerfile for image should be https://github.com/ybettan/coreos-layering-driver-toolkit/blob/e27b82a47dae3c46f8dfcfb9c4c8dc6c9322fc73/in-an-ocp-cluster/container-image/Dockerfile#L13, all it does it add some files to /etc/, but you can build your own, just FROM something older than your current cluster) 

3. Wait for your nodes to rebase to the new image and reboot 
4. Watch for your nodes to pop in and out of "NotReady": 

 while true ; do sleep 2;  oc get node | grep NotReady ; done

5. Kubelet log seems to indicate repeated timeouts and restarts: 

Sep 07 17:03:46.444712 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: start operation timed out. Terminating.
Sep 07 17:03:46.445422 jkyros-with-extension-ldk7b-master-2 kubenswrapper[1561]: I0907 17:03:46.445394    1561 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca
.crt"
Sep 07 17:03:46.469193 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Failed with result 'timeout'.
Sep 07 17:03:46.469849 jkyros-with-extension-ldk7b-master-2 systemd[1]: Failed to start Kubernetes Kubelet.
Sep 07 17:03:46.470099 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 9.343s CPU time
Sep 07 17:03:56.694691 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Sep 07 17:03:56.695180 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1.
Sep 07 17:03:56.695948 jkyros-with-extension-ldk7b-master-2 systemd[1]: Stopped Kubernetes Kubelet.
Sep 07 17:03:56.696036 jkyros-with-extension-ldk7b-master-2 systemd[1]: kubelet.service: Consumed 0 CPU time

Actual results:

Kublet will "flap", going back and forth between available and not available.

Expected results:

Kubelet is stable.

Additional info:

I don't know if it's a red herring, but every time this seems to happen, packages have been downgraded. I haven't narrowed it down to a specific package yet, and I don't understand the exact mechanism. The kubelet itself doesn't get downgraded, but I could see cri-o being relevant. 

Staging deployment...done
Downgraded:
  NetworkManager 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  NetworkManager-cloud-setup 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  NetworkManager-libnm 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  NetworkManager-ovs 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  NetworkManager-team 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  NetworkManager-tui 1:1.36.0-8.el8_6 -> 1:1.36.0-7.el8_6
  containers-common 2:1-27.rhaos4.12.el8 -> 2:1-22.rhaos4.11.el8
  cri-o 1.25.0-51.rhaos4.12.git315a0cb.el8 -> 1.25.0-33.rhaos4.12.gitda7b5b1.el8
  cri-tools 1.25.0-1.el8 -> 1.24.2-5.el8
  curl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3
  libcurl 7.61.1-22.el8_6.4 -> 7.61.1-22.el8_6.3
  open-vm-tools 11.3.5-1.el8_6.1 -> 11.3.5-1.el8
  openshift-clients 4.12.0-202209061108.p0.g427ed14.assembly.stream.el8 -> 4.12.0-202208191215.p0.ge0f8e21.assembly.stream.el8
  openshift-hyperkube 4.12.0-202209022108.p0.gebabf6d.assembly.stream.el8 -> 4.12.0-202208161547.p0.ged93380.assembly.stream.el8
  openvswitch2.17 2.17.0-37.1.el8fdp -> 2.17.0-31.el8fdp
  ostree 2022.2-5.el8 -> 2022.1-2.el8
  ostree-grub2 2022.2-5.el8 -> 2022.1-2.el8
  ostree-libs 2022.2-5.el8 -> 2022.1-2.el8
  podman 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8
  podman-catatonit 2:4.2.0-1.rhaos4.12.el8 -> 2:4.0.2-6.rhaos4.11.el8
  rsync 3.1.3-14.el8_6.3 -> 3.1.3-14.el8_6.2
  selinux-policy 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1
  selinux-policy-targeted 3.14.3-95.el8_6.4 -> 3.14.3-95.el8_6.1
  systemd 239-58.el8_6.4 -> 239-58.el8_6.3
  systemd-journal-remote 239-58.el8_6.4 -> 239-58.el8_6.3
  systemd-libs 239-58.el8_6.4 -> 239-58.el8_6.3
  systemd-pam 239-58.el8_6.4 -> 239-58.el8_6.3
  systemd-udev 239-58.el8_6.4 -> 239-58.el8_6.3
  tzdata 2022c-1.el8 -> 2022a-1.el8

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather-osimageurl-kubelet.tar.gz
21.45 MB
2022/09/08 4:06 PM

relates to

MCO-298 [spike] layering integration / preflight checks design

To Do

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates