-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
None
-
False
-
None
-
False
-
rhel-sst-container-tools
-
-
Summary of the issue
There have been a few incidents (RCMAUTO-6093, TRT-1625, likely others) where caching for:
https://registry.redhat.io/containers/sigstore/.../signature-2:
has lead to errors like those seen in this CI run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1783654894146686976/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-marketplace" and (tostring | contains("signature"))) | (.source | tostring) + " " + .message' {"component":"kubelet","host":"ci-op-cdfql34b-f9945-g4r4x-master-2"} Failed to pull image "registry.redhat.io/redhat/certified-operator-index:v4.16": copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/redhat/certified-operator-index@sha256=a81e6d248af48ea2071c976decc752935740b769598179756d07b8b15a4f89cc/signature-2: unrecognized signature format, starting with binary 0x3c ...
My understanding is that when the signature-retrieval function finds an invalid signature like that signature-2 (text/html, when it should have been a 404 or GPG signature), the "hey, that's not a GPG signature!" error gets bubbled up to the kubelet and on into a Kubernetes Event, without CRI-O checking to see if the successfully retrieved signature-1 was sufficient to validate the image.
This ticket is suggesting that when signature-retrieval is partially successful, CRI-O process the partial results to see if they are sufficient, and launch the cluster if they are (while continuing to fail if they are not).
Error handling gets a bit fiddly. When the partial results are sufficient to validate the image, is it worth logging and/or bubbling up "...but we did have some unexpected difficulty with signature-retrieval" warnings? And when the partial results are insufficient, do we want to return both "signature-1 wasn't trusted" and "I had trouble retrieving signature-2" in the error passed up to the kubelet? But however you go on that side of things, the partial-signature-processing pivot should increase the cluster's resiliency in the face of this kind of signature-server caching/access trouble.
- relates to
-
TRT-1625 KubePodNotReady alert in openshift-marketplace is killing payloads
-
- Closed
-
I think doing both is the clearly the best thing here.
That’s unclear…
… but, I’m afraid, it highlights a bigger issue: The signature policy enforcement is not the only place reading signatures.
Image copies c/image/copy.Image also read signatures in order to write them into the destination; and in that case we do, in general, need all of them so that the copy at the destination is correct and complete. The caller can opt out of that by setting Options.RemoveSignatures.
But ordinary “pulls” do not opt out that way: we do want to store copies of the signatures locally, because we want a local record of how the image managed to pass the signature check (e.g. to allow forensic examination of a compromised computer, if the registry serving a malicious-but-correctly-signed image disappears after a successful attack).
So that ~rules out the simple approach: We could handle partial signature lists in the signature check code, but the pull as a whole would still fail.
We would need an entire new mode, where we collect the signature blobs that were required to confirm to the policy (and other blobs we have available?), and write those to the destination, even if we can’t read everything form the source.