Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24743

Remove CRI-O-update-triggered image wipe

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • 4.16.0
    • 4.13, 4.12, 4.11, 4.14, 4.15, 4.16
    • Node / CRI-O
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the CRI-O removed all of the images installed between minor version upgrades of {product-name} to ensure stale payload images didn't take up space on the node. However, it was decided this was a performance penalty, and this functionality was removed. With this fix, the kubelet will still garbage collect stale images after disk usage hits a certain level. As a result, {product-title} no longer removes all images after an upgrade between minor versions. (link:https://issues.redhat.com/browse/OCPBUGS-24743[*OCPBUGS-24743*])
      Show
      Previously, the CRI-O removed all of the images installed between minor version upgrades of {product-name} to ensure stale payload images didn't take up space on the node. However, it was decided this was a performance penalty, and this functionality was removed. With this fix, the kubelet will still garbage collect stale images after disk usage hits a certain level. As a result, {product-title} no longer removes all images after an upgrade between minor versions. (link: https://issues.redhat.com/browse/OCPBUGS-24743 [* OCPBUGS-24743 *])
    • Removed Functionality

      Description of problem:

      Since many 4.y ago, before 4.11 and all the minor versions that are still supported, CRI-O has wiped images when it comes up after a node reboot and notices it has a new (minor?) version. This causes redundant pulls, as seen in this 4.11-to-4.12 update run:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1732741139229839360/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes/ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4/journal | zgrep 'Starting update from rendered-\|crio-wipe\|Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2'
      Dec 07 13:05:42.474144 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded.
      Dec 07 13:05:42.481470 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 191ms CPU time
      Dec 07 13:59:51.000686 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1498]: time="2023-12-07 13:59:51.000591203Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=a62bc972-67d7-401a-9640-884430bd16f1 name=/runtime.v1.ImageService/PullImage
      Dec 07 14:00:55.745095 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 root[101294]: machine-config-daemon[99469]: Starting update from rendered-worker-ca36a33a83d49b43ed000fd422e09838 to rendered-worker-c0b3b4eadfe6cdfb595b97fa293a9204: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
      Dec 07 14:05:33.274241 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded.
      Dec 07 14:05:33.289605 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 216ms CPU time
      Dec 07 14:14:50.277011 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1573]: time="2023-12-07 14:14:50.276961087Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=1a092fbd-7ffa-475a-b0b7-0ab115dbe173 name=/runtime.v1.ImageService/PullImage
      

      The redundant pulls cost network and disk traffic, and avoiding them should make those update-initiated reboots quicker and cheaper. The lack of update-initiated wipes is not expected to cost much, because the Kubelet's old-image garbage collection should be along to clear out any no-longer-used images if disk space gets tight.

      Version-Release number of selected component (if applicable):

      At least 4.11. Possibly older 4.y; I haven't checked.

      How reproducible:

      Every time.

      Steps to Reproduce:

      1. Install a cluster.
      2. Update to a release image with a different CRI-O (minor?) version.
      3. Check logs on the nodes.

      Actual results:

      crio-wipe entries in the logs, with reports of target-release images being pulled before and after those wipes, as I quoted in the Description.

      Expected results:

      Target-release images pulled before the reboot, and found in the local cache if that image is needed again post-reboot.

              pehunt@redhat.com Peter Hunt
              trking W. Trevor King
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: