Uploaded image for project: 'MicroShift'
  1. MicroShift
  2. USHIFT-5437

MicroShift - troubleshoot containers not starting after ostree system upgrade/rollback

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Can't Do
    • Icon: Major Major
    • openshift-4.17
    • openshift-4.14, openshift-4.15, openshift-4.16, openshift-4.17
    • None
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Customer Reported
    • None
    • None
    • None

      Documentation is needed for identifying and resolving the following edge case. For this case to occur, ALL of these must be true:

      • MicroShift is installed on a ostree system, such as RH Device Edge.
      • A user workload is deployed onto the system via a non-ostree path, e.g. by applying manifests directly to the cluster (via helm, oc, kubectl, etc)
      • The user's workload shares a container image layer with a MicroShift workload, for instance, ubi9/ubi-minimal.

      The impact of these evil stars aligning will appear when the user upgrades or downgrades MicroShift. This will result in the user's workloads failing to start after the system is rebooted.

      Indicators of that the user has stumbled into this edge case are:

      • Pod statuses show "CreateContainerError"

       

      $ oc get pod -n my-ns
      NAMEMESPACE                              NAME                                READY   STATUS    RESTARTS      AGE
      my-ns                          my-pod                   2/2     CreateContainerError   0             24h

       

      AND

      • Pod Describe contains this event:
      $ oc describe pod -n my-ns my-pod
      
      <...omitted...>
      
      Warning Failed 15m (x3 over 16m) kubelet (combined from similar events): Error: failed to mount container k8s_<POD>-7685458cdf_xxxx_301a0d64-1993-45b9-a040-0a94e7fb6b5b_0(c75248228fe35a43c43b4875183d314d50d48165512659745dcf10fedb4d7f13): readlink /var/lib/containers/storage/overlay/l/KBKAXATHXY65BFU6TBGFTMOCWS: no such file or directory

       

      Further indicators appear in journal output:

      $ journalctl -u crio
      Sep 19 19:17:19 edgenius crio[1408]: time="2024-09-19 19:17:19.412205267Z" level=warning msg="Can't stat lower layer \"/var/lib/containers/storage/overlay/l/QX7R7TM2AO4PWCREA35WV3KGXF\" because it does not exist. Going through storage to recreate the missing symlinks."

       

      Proposed Solution:

      1. Troubleshooting sub-chapter to provide the above characterization to aid users in diagnosing the edge case, propose production-ready solution, and propose developer-workaround.
      2. Add in-line warnings under "Embedding in a RHEL for Edge image" -> "Chapter 1. Embedding in a RHEL for Edge image using image builder" and -> "Chapter 3. Embedding in a RHEL for Edge image for offline use". These warnings should make the user aware of the risk of deploying workloads directly instead of embedding the workload container images in an ostree layer.

              jcope@redhat.com Jon Cope
              jcope@redhat.com Jon Cope
              None
              Shauna Diaz
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: