Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-3245

Impact: Restorecon failure in OCP 4.18, causing kubelet to not start

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None

      Which 4.y.z to 4.y'.z' updates increase vulnerability?

      • All clusters upgrading to versions between 4.18.0 and 4.18.12
      • And, All clusters running 4.18.0-4.18.12

      Which types of clusters?

      • Root cause is that restorecon of /var/lib/kubelet takes longer than the 90s timeout, however it's not clear what conditions trigger relabeling of that path to take longer than 90s. Likely either a high number of mounted volumes or less performant storage for the filesystem where /var/lib/kubelet resides

      What is the impact? Is it serious enough to warrant removing update recommendations?

      • If a cluster hits this bug, it renders the node not ready without remediation

      How involved is remediation?

      • Try to manually relabel /var/lib/kubelet by running `/usr/sbin/restorecon -rv /var/lib/kubelet/ /usr/local/bin/kubenswrapper /usr/bin/kubensenter` interactively to avoid the timeout. However even if the path is relabeled successfully it may still present a problem.
      • Create a systemd dropin at /etc/systemd/system/kubelet.service.d/99-narrow-restorecon.conf with the following content
      [Service]
      ExecStartPre=
      ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests
      ExecStartPre=-/usr/sbin/restorecon -ri /var/lib/kubelet/pod-resources /usr/local/bin/kubenswrapper /usr/bin/kubensenter 
      • This could be done via MachineConfig prior to upgrading to 4.18.
       ---
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: 99-worker-narrow-restorecon
      spec:
        config:
          ignition:
            version: 3.2.0
          storage:
            files:
            - contents:
                source: data:text/plain;charset=utf-8;base64,W1NlcnZpY2VdCkV4ZWNTdGFydFByZT0KRXhlY1N0YXJ0UHJlPS9iaW4vbWtkaXIgLS1wYXJlbnRzIC9ldGMva3ViZXJuZXRlcy9tYW5pZmVzdHMKRXhlY1N0YXJ0UHJlPS0vdXNyL3NiaW4vcmVzdG9yZWNvbiAtcmkgL3Zhci9saWIva3ViZWxldC9wb2QtcmVzb3VyY2VzIC91c3IvbG9jYWwvYmluL2t1YmVuc3dyYXBwZXIgL3Vzci9iaW4va3ViZW5zZW50ZXIK
              mode: 0640
              overwrite: true
              path: /etc/systemd/system/kubelet.service.d/99-narrow-restorecon.conf
        osImageURL: ""

      Is this a regression?

      • Yes
         

              pehunt@redhat.com Peter Hunt
              trking W. Trevor King
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: