[OCPBUGS-38020] Imagepruner failed to delete layer link

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16, 4.17
Component/s: ImageStreams
Labels:
- groomed

Severity:
Low
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When set imagepruner with "keepTagRevisions":0,"keepYoungerThanDuration":"0s" , the pruner pods will fail after 5 reties with error 
"error deleting repository openshift/** layer link sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096 from the registry: 500 Internal Server Error
error deleting repository openshift/** layer link sha256:7027f4e4058bde8aaa497e47562e962c293039ba16f5fbfd07ff43a0d1dbd5a2 from the registry: 500 Internal Server Error"
That will cause image registry to Degraded finially.

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-01-213905

How reproducible:

always

Steps to Reproduce:

    1. Clean up cluster images by setting
$oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":0,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge     
    2. for i in {1..7}; do oc new-project test$i ; oc new-app httpd~https://github.com/openshift/httpd-ex.git -n test$i ; sleep 30 ; oc delete istag httpd-ex:latest ; done 
    3. Check the image pruner pod

Actual results:

'ImagePrunerDegraded: Job has reached the specified backoff limit'

Expected results:

Should clean up the metadata of all unused images

Additional info:

The must-gather log

https://drive.google.com/drive/folders/1DY1dkvWOCNBV8cRnjRaWLa7KPTxgVR7t?usp=drive_link

    This only happens on 4.17 recently. and I found some error in apiserver pods 'E0806 02:48:05.974562       1 strategy.go:60] unable to parse manifest for "sha256:****": unexpected end of JSON input' , but these similar errors existing in 4.16 apiserver pods when same setting imagepruner passed

    UPDATE: this also happens on 4.16.6 https://issues.redhat.com/browse/OCPBUGS-38020?focusedId=25291852&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25291852

Pinned comments

Pinned by Flavian Missi

Flavian Missi added a comment - 2024/08/08 11:32 AM - edited

Here's a workaround.

After reproducing the bug with the provided reproducing steps, do the following:

oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":1,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge

Then observe the pruner job:

oc get job -w

If a job is currently running then it will probably fail, but the next job to run should succeed.

I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails.

Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.

Flavian Missi added a comment - 2024/08/08 11:32 AM - edited Here's a workaround . After reproducing the bug with the provided reproducing steps, do the following: oc patch imagepruner/cluster -p '{ "spec" :{ "keepTagRevisions" :1, "keepYoungerThanDuration" : "0s" , "schedule" : "* * * * *" }}' --type=merge Then observe the pruner job: oc get job -w If a job is currently running then it will probably fail, but the next job to run should succeed. I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails. Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.

All comments

OpenShift Jira Bot added a comment - 2024/11/25 11:51 PM

This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

OpenShift Jira Bot added a comment - 2024/11/25 11:51 PM This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

Flavian Missi added a comment - 2024/08/27 8:16 AM

I'm lowering the priority and severity of this bug for a couple of reasons: 1) there is a workaround; 2) the configuration that triggers the bug is not a very common use case for the pruner.

rh-ee-xiuwang let me know if you think this isn't the right way forward and I'll happily re-assess.

Flavian Missi added a comment - 2024/08/27 8:16 AM I'm lowering the priority and severity of this bug for a couple of reasons: 1) there is a workaround; 2) the configuration that triggers the bug is not a very common use case for the pruner. rh-ee-xiuwang let me know if you think this isn't the right way forward and I'll happily re-assess.

Pinned by Flavian Missi

Flavian Missi added a comment - 2024/08/08 11:32 AM - edited

Here's a workaround.

After reproducing the bug with the provided reproducing steps, do the following:

oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":1,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge

Then observe the pruner job:

oc get job -w

If a job is currently running then it will probably fail, but the next job to run should succeed.

Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.

Flavian Missi added a comment - 2024/08/08 11:00 AM

Xiujuan has double checked and this bug is also present in 4.16. Either this is new on 4.14 (from the upstream distribution upgrade) or something in the AWS S3 API has changed and our version of the SDK no longer understands "Path not found" errors for what they are. I'm still digging to understand exactly why this is happening.

I'm able to reproduce with the extra reproducing step (now added to the bug description as step #2).

Currently working on a workaround.

Flavian Missi added a comment - 2024/08/08 11:00 AM Xiujuan has double checked and this bug is also present in 4.16 . Either this is new on 4.14 (from the upstream distribution upgrade) or something in the AWS S3 API has changed and our version of the SDK no longer understands "Path not found" errors for what they are. I'm still digging to understand exactly why this is happening. I'm able to reproduce with the extra reproducing step (now added to the bug description as step #2). Currently working on a workaround.

XiuJuan Wang added a comment - 2024/08/08 10:52 AM

I could reproduce this bug on 4.16.6

XiuJuan Wang added a comment - 2024/08/08 10:52 AM I could reproduce this bug on 4.16.6

Flavian Missi added a comment - 2024/08/08 9:53 AM - edited

Just noticed the bug description answers my question about whether this is a regression:
This only happens on 4.17 recently.
So it does seem to be. I'll see if I can find a workaround. It might be that lowering the number of workers avoids the issue - we'll have to try and see.

Flavian Missi added a comment - 2024/08/08 9:53 AM - edited Just noticed the bug description answers my question about whether this is a regression: This only happens on 4.17 recently. So it does seem to be. I'll see if I can find a workaround. It might be that lowering the number of workers avoids the issue - we'll have to try and see.

Flavian Missi added a comment - 2024/08/08 9:48 AM

Thank you for the new logs rh-ee-xiuwang, I found the issue!

Here's what I'm seeing in the registry logs:

time="2024-08-08T02:27:05.10174321Z" level=error msg="Unknown error deleting blob: s3aws: Path not found: /docker/registry/v2/repositories/openshift/python/_layers/sha256/8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096/link" go.version="go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=ae09ee75-1c7a-4110-b435-fa8a274580aa http.request.method=DELETE http.request.remoteaddr="10.129.2.16:45706" http.request.uri="/v2/openshift/python/blobs/sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" http.request.useragent="oc/4.17.0 (linux/amd64) kubernetes/4e7fcfe" openshift.auth.user="system:serviceaccount:openshift-image-registry:pruner" openshift.auth.userid=23bceead-5c32-4845-a84f-e0476be48218 vars.digest="sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" vars.name=openshift/python

So it seems that the image registry's DELETE endpoint does not understand a Path not found from the s3 driver as a 404 so returns it as an unknown error.

Do you get the same result if you set the number of workers to 1? The default number is 5, I suspect there might be a race condition deleting the link and subsequent attempts fail with the 500 (which is really a not found).

Regardless of this being caused by a race condition, I'll have a look at the image registry DELETE code. I suspect this might be a regression in 4.14 since we haven't changed anything relevant in the image registry since then. rh-ee-xiuwang could you help me confirm that this is also present at least in 4.16 so I know how to prioritize? (we might need to make it a release blocker if it's new to 4.17, though as I said I doubt that is the case.)

Flavian Missi added a comment - 2024/08/08 9:48 AM Thank you for the new logs rh-ee-xiuwang , I found the issue! Here's what I'm seeing in the registry logs: time="2024-08-08T02:27:05.10174321Z" level=error msg="Unknown error deleting blob: s3aws: Path not found: /docker/registry/v2/repositories/openshift/python/_layers/sha256/8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096/link" go.version="go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=ae09ee75-1c7a-4110-b435-fa8a274580aa http.request.method=DELETE http.request.remoteaddr="10.129.2.16:45706" http.request.uri="/v2/openshift/python/blobs/sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" http.request.useragent="oc/4.17.0 (linux/amd64) kubernetes/4e7fcfe" openshift.auth.user="system:serviceaccount:openshift-image-registry:pruner" openshift.auth.userid=23bceead-5c32-4845-a84f-e0476be48218 vars.digest="sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" vars.name=openshift/python So it seems that the image registry's DELETE endpoint does not understand a Path not found from the s3 driver as a 404 so returns it as an unknown error. Do you get the same result if you set the number of workers to 1? The default number is 5, I suspect there might be a race condition deleting the link and subsequent attempts fail with the 500 (which is really a not found). Regardless of this being caused by a race condition, I'll have a look at the image registry DELETE code. I suspect this might be a regression in 4.14 since we haven't changed anything relevant in the image registry since then. rh-ee-xiuwang could you help me confirm that this is also present at least in 4.16 so I know how to prioritize? (we might need to make it a release blocker if it's new to 4.17, though as I said I doubt that is the case.)

XiuJuan Wang added a comment - 2024/08/08 3:16 AM

https://drive.google.com/drive/folders/1RHO55Hx5QRcVMsFucXcsK2IzTRtuE9a_?usp=sharing

The image registry debug log added

XiuJuan Wang added a comment - 2024/08/08 3:16 AM https://drive.google.com/drive/folders/1RHO55Hx5QRcVMsFucXcsK2IzTRtuE9a_?usp=sharing The image registry debug log added

Flavian Missi added a comment - 2024/08/07 12:54 PM

Aha, thanks for the update rh-ee-xiuwang! I'm having trouble getting a cluster from cluster bot today, so I haven't been able to try the new reproducing steps yet. Also I think the image registry logs might be more helpful here as I suspect the pruner is using the image registry to delete the image and their links.

Flavian Missi added a comment - 2024/08/07 12:54 PM Aha, thanks for the update rh-ee-xiuwang ! I'm having trouble getting a cluster from cluster bot today, so I haven't been able to try the new reproducing steps yet. Also I think the image registry logs might be more helpful here as I suspect the pruner is using the image registry to delete the image and their links.

XiuJuan Wang added a comment - 2024/08/07 6:39 AM

https://drive.google.com/file/d/138T1_FEz5N596_elw7sVsWO6dLtgRIut/view?usp=sharing Here is the image prune pods log with debug level, but I didn't see more useful info neither.

XiuJuan Wang added a comment - 2024/08/07 6:39 AM https://drive.google.com/file/d/138T1_FEz5N596_elw7sVsWO6dLtgRIut/view?usp=sharing Here is the image prune pods log with debug level, but I didn't see more useful info neither.

Assignee:: Flavian Missi

Reporter:: XiuJuan Wang

QA Contact:: XiuJuan Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/06 10:25 AM

Updated:: 2024/11/25 11:51 PM

Resolved:: 2024/11/25 11:51 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Pinned by Flavian Missi Flavian Missi added a comment - 2024/08/08 11:32 AM, Edited by Flavian Missi - 2024/08/08 11:32 AM

Expand comment: Pinned by Flavian Missi Flavian Missi added a comment - 2024/08/08 11:32 AM, Edited by Flavian Missi - 2024/08/08 11:32 AM

Collapse comment: OpenShift Jira Bot added a comment - 2024/11/25 11:51 PM

Expand comment: OpenShift Jira Bot added a comment - 2024/11/25 11:51 PM

Collapse comment: Flavian Missi added a comment - 2024/08/27 8:16 AM

Expand comment: Flavian Missi added a comment - 2024/08/27 8:16 AM

Collapse comment: Pinned by Flavian Missi Flavian Missi added a comment - 2024/08/08 11:32 AM, Edited by Flavian Missi - 2024/08/08 11:32 AM

Expand comment: Pinned by Flavian Missi Flavian Missi added a comment - 2024/08/08 11:32 AM, Edited by Flavian Missi - 2024/08/08 11:32 AM

Collapse comment: Flavian Missi added a comment - 2024/08/08 11:00 AM

Expand comment: Flavian Missi added a comment - 2024/08/08 11:00 AM

Collapse comment: XiuJuan Wang added a comment - 2024/08/08 10:52 AM

Expand comment: XiuJuan Wang added a comment - 2024/08/08 10:52 AM

Collapse comment: Flavian Missi added a comment - 2024/08/08 9:53 AM, Edited by Flavian Missi - 2024/08/08 9:53 AM

Expand comment: Flavian Missi added a comment - 2024/08/08 9:53 AM, Edited by Flavian Missi - 2024/08/08 9:53 AM

Collapse comment: Flavian Missi added a comment - 2024/08/08 9:48 AM

Expand comment: Flavian Missi added a comment - 2024/08/08 9:48 AM

Collapse comment: XiuJuan Wang added a comment - 2024/08/08 3:16 AM

Expand comment: XiuJuan Wang added a comment - 2024/08/08 3:16 AM

Collapse comment: Flavian Missi added a comment - 2024/08/07 12:54 PM

Expand comment: Flavian Missi added a comment - 2024/08/07 12:54 PM

Collapse comment: XiuJuan Wang added a comment - 2024/08/07 6:39 AM

Expand comment: XiuJuan Wang added a comment - 2024/08/07 6:39 AM

People

Dates