• Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Normal Normal
    • None
    • 4.16, 4.17
    • ImageStreams
    • Low
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When set imagepruner with "keepTagRevisions":0,"keepYoungerThanDuration":"0s" , the pruner pods will fail after 5 reties with error 
      "error deleting repository openshift/** layer link sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096 from the registry: 500 Internal Server Error
      error deleting repository openshift/** layer link sha256:7027f4e4058bde8aaa497e47562e962c293039ba16f5fbfd07ff43a0d1dbd5a2 from the registry: 500 Internal Server Error"
      That will cause image registry to Degraded finially.

      Version-Release number of selected component (if applicable):

          4.17.0-0.nightly-2024-08-01-213905

      How reproducible:

      always    

      Steps to Reproduce:

          1. Clean up cluster images by setting
      $oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":0,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge     
          2. for i in {1..7}; do oc new-project test$i ; oc new-app httpd~https://github.com/openshift/httpd-ex.git -n test$i ; sleep 30 ; oc delete istag httpd-ex:latest ; done 
          3. Check the image pruner pod     

      Actual results:

      'ImagePrunerDegraded: Job has reached the specified backoff limit'

      Expected results:

      Should clean up the metadata of all unused images

      Additional info:

      The must-gather log

      https://drive.google.com/drive/folders/1DY1dkvWOCNBV8cRnjRaWLa7KPTxgVR7t?usp=drive_link 

          This only happens on 4.17 recently. and I found some error in apiserver pods 'E0806 02:48:05.974562       1 strategy.go:60] unable to parse manifest for "sha256:****": unexpected end of JSON input' , but these similar errors existing in 4.16 apiserver pods when same setting imagepruner passed
      
          UPDATE: this also happens on 4.16.6 https://issues.redhat.com/browse/OCPBUGS-38020?focusedId=25291852&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25291852

       

            [OCPBUGS-38020] Imagepruner failed to delete layer link

            Pinned comments

            Pinned by Flavian Missi

            Flavian Missi added a comment - - edited

            Here's a workaround.

            After reproducing the bug with the provided reproducing steps, do the following:

            oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":1,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge 

            Then observe the pruner job:

            oc get job -w 

            If a job is currently running then it will probably fail, but the next job to run should succeed.

            I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails.

            Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.

             

            Flavian Missi added a comment - - edited Here's a workaround . After reproducing the bug with the provided reproducing steps, do the following: oc patch imagepruner/cluster -p '{ "spec" :{ "keepTagRevisions" :1, "keepYoungerThanDuration" : "0s" , "schedule" : "* * * * *" }}' --type=merge Then observe the pruner job: oc get job -w If a job is currently running then it will probably fail, but the next job to run should succeed. I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails. Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.  

            All comments

            This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            OpenShift Jira Bot added a comment - This bug is being closed because it has not had any activity in the past 3 months. While it represents a valid problem, leaving such bugs open provides a false indication that they will be addressed. Please reopen the bug if you have additional context that would help us better understand what needs to be done.

            I'm lowering the priority and severity of this bug for a couple of reasons: 1) there is a workaround; 2) the configuration that triggers the bug is not a very common use case for the pruner.

            rh-ee-xiuwang let me know if you think this isn't the right way forward and I'll happily re-assess.

            Flavian Missi added a comment - I'm lowering the priority and severity of this bug for a couple of reasons: 1) there is a workaround; 2) the configuration that triggers the bug is not a very common use case for the pruner. rh-ee-xiuwang let me know if you think this isn't the right way forward and I'll happily re-assess.

            Pinned by Flavian Missi

            Flavian Missi added a comment - - edited

            Here's a workaround.

            After reproducing the bug with the provided reproducing steps, do the following:

            oc patch imagepruner/cluster -p '{"spec":{"keepTagRevisions":1,"keepYoungerThanDuration":"0s","schedule": "* * * * *"}}' --type=merge 

            Then observe the pruner job:

            oc get job -w 

            If a job is currently running then it will probably fail, but the next job to run should succeed.

            I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails.

            Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.

             

            Flavian Missi added a comment - - edited Here's a workaround . After reproducing the bug with the provided reproducing steps, do the following: oc patch imagepruner/cluster -p '{ "spec" :{ "keepTagRevisions" :1, "keepYoungerThanDuration" : "0s" , "schedule" : "* * * * *" }}' --type=merge Then observe the pruner job: oc get job -w If a job is currently running then it will probably fail, but the next job to run should succeed. I don't know how to explain it yet but I think that setting keepTagRevisions to 0 (and thus setting all images for deletion) causes a race condition to keep happening. Since we can't set the number of workers in the cronjob (only in the oc adm prune images directly) this fails. Speaking of the command line, using oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0s --num-workers=1 --registry-url=$REGISTRY works reliably.  

            Xiujuan has double checked and this bug is also present in 4.16. Either this is new on 4.14 (from the upstream distribution upgrade) or something in the AWS S3 API has changed and our version of the SDK no longer understands "Path not found" errors for what they are. I'm still digging to understand exactly why this is happening.

            I'm able to reproduce with the extra reproducing step (now added to the bug description as step #2).

            Currently working on a workaround.

            Flavian Missi added a comment - Xiujuan has double checked and this bug is also present in 4.16 . Either this is new on 4.14 (from the upstream distribution upgrade) or something in the AWS S3 API has changed and our version of the SDK no longer understands "Path not found" errors for what they are. I'm still digging to understand exactly why this is happening. I'm able to reproduce with the extra reproducing step (now added to the bug description as step #2). Currently working on a workaround.

            I could reproduce this bug on 4.16.6

            XiuJuan Wang added a comment - I could reproduce this bug on 4.16.6

            Flavian Missi added a comment - - edited

            Just noticed the bug description answers my question about whether this is a regression:
            This only happens on 4.17 recently.
            So it does seem to be. I'll see if I can find a workaround. It might be that lowering the number of workers avoids the issue - we'll have to try and see.

            Flavian Missi added a comment - - edited Just noticed the bug description answers my question about whether this is a regression: This only happens on 4.17 recently. So it does seem to be. I'll see if I can find a workaround. It might be that lowering the number of workers avoids the issue - we'll have to try and see.

            Thank you for the new logs rh-ee-xiuwang, I found the issue!

            Here's what I'm seeing in the registry logs:

            time="2024-08-08T02:27:05.10174321Z" level=error msg="Unknown error deleting blob: s3aws: Path not found: /docker/registry/v2/repositories/openshift/python/_layers/sha256/8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096/link" go.version="go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=ae09ee75-1c7a-4110-b435-fa8a274580aa http.request.method=DELETE http.request.remoteaddr="10.129.2.16:45706" http.request.uri="/v2/openshift/python/blobs/sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" http.request.useragent="oc/4.17.0 (linux/amd64) kubernetes/4e7fcfe" openshift.auth.user="system:serviceaccount:openshift-image-registry:pruner" openshift.auth.userid=23bceead-5c32-4845-a84f-e0476be48218 vars.digest="sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" vars.name=openshift/python 

            So it seems that the image registry's DELETE endpoint does not understand a Path not found from the s3 driver as a 404 so returns it as an unknown error.

            Do you get the same result if you set the number of workers to 1? The default number is 5, I suspect there might be a race condition deleting the link and subsequent attempts fail with the 500 (which is really a not found).

            Regardless of this being caused by a race condition, I'll have a look at the image registry DELETE code. I suspect this might be a regression in 4.14 since we haven't changed anything relevant in the image registry since then. rh-ee-xiuwang could you help me confirm that this is also present at least in 4.16 so I know how to prioritize? (we might need to make it a release blocker if it's new to 4.17, though as I said I doubt that is the case.)

            Flavian Missi added a comment - Thank you for the new logs rh-ee-xiuwang , I found the issue! Here's what I'm seeing in the registry logs: time="2024-08-08T02:27:05.10174321Z" level=error msg="Unknown error deleting blob: s3aws: Path not found: /docker/registry/v2/repositories/openshift/python/_layers/sha256/8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096/link" go.version="go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=ae09ee75-1c7a-4110-b435-fa8a274580aa http.request.method=DELETE http.request.remoteaddr="10.129.2.16:45706" http.request.uri="/v2/openshift/python/blobs/sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" http.request.useragent="oc/4.17.0 (linux/amd64) kubernetes/4e7fcfe" openshift.auth.user="system:serviceaccount:openshift-image-registry:pruner" openshift.auth.userid=23bceead-5c32-4845-a84f-e0476be48218 vars.digest="sha256:8694db102e5bd27fa30106f87d5a0f0c5ccccac0e5cc38ba56080d7559377096" vars.name=openshift/python So it seems that the image registry's DELETE endpoint does not understand a Path not found from the s3 driver as a 404 so returns it as an unknown error. Do you get the same result if you set the number of workers to 1? The default number is 5, I suspect there might be a race condition deleting the link and subsequent attempts fail with the 500 (which is really a not found). Regardless of this being caused by a race condition, I'll have a look at the image registry DELETE code. I suspect this might be a regression in 4.14 since we haven't changed anything relevant in the image registry since then. rh-ee-xiuwang could you help me confirm that this is also present at least in 4.16 so I know how to prioritize? (we might need to make it a release blocker if it's new to 4.17, though as I said I doubt that is the case.)

            XiuJuan Wang added a comment - https://drive.google.com/drive/folders/1RHO55Hx5QRcVMsFucXcsK2IzTRtuE9a_?usp=sharing   The image registry debug log added

            Aha, thanks for the update rh-ee-xiuwang! I'm having trouble getting a cluster from cluster bot today, so I haven't been able to try the new reproducing steps yet. Also I think the image registry logs might be more helpful here as I suspect the pruner is using the image registry to delete the image and their links.

            Flavian Missi added a comment - Aha, thanks for the update rh-ee-xiuwang ! I'm having trouble getting a cluster from cluster bot today, so I haven't been able to try the new reproducing steps yet. Also I think the image registry logs might be more helpful here as I suspect the pruner is using the image registry to delete the image and their links.

            https://drive.google.com/file/d/138T1_FEz5N596_elw7sVsWO6dLtgRIut/view?usp=sharing Here is the image prune pods log with debug level, but I didn't see more useful info neither.

            XiuJuan Wang added a comment - https://drive.google.com/file/d/138T1_FEz5N596_elw7sVsWO6dLtgRIut/view?usp=sharing Here is the image prune pods log with debug level, but I didn't see more useful info neither.

              fmissi Flavian Missi
              rh-ee-xiuwang XiuJuan Wang
              XiuJuan Wang XiuJuan Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: