Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-3660

Retry for imagestreamtag import

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request 

      --->>> Retry for imagestreamtag import

       

      2. What is the nature and description of the request?                                                                 ---->>> If an imagestream/imagestreamtag is created and it fails to import the tag because of some error, it should retry the import relatively frequently, with a configurable retry.

       

      3. Why does the customer need this? (List the business requirements here)                           --->>> We create our imagestreams and deployments declaratively, via YAML in a git repo.  We replicate our containers and our YAML repos to assorted sites.  Sometimes the YAML replication finishes before the image replication.  When it does, this causes the initial import of the imagestream to fail with a NotFound error.  Right now, the only way to get OpenShift to retry the failure is add the "Scheduled" option to the imagestream.  But "Scheduled" isn't really intended as a retry, so it's a very low frequency – something like 15 minutes.  OpenShift engineering has already stated that they don't want to allow "scheduled" to be tuned to be faster because of concerns of impacting Internet registries, where the registries can be bombarded by queries for all the imagestreams out there.  But this issue here isn't really the same thing – we're just looking to retry if there is an initial failure.  Once the imagestream tag is pulled successfully, the retry would no longer apply, so all the Internet registries should be safe. Note that pods already retry images if there is an initial failure.  It's only imagestreams that don't retry.  We use imagestreams as a caching layer. This very much matters to us from a business perspective because some of our most important customers want to be able to do emergency bugfixes or upgrades.  If something goes wrong with an existing deployment, they want to be able to roll out a new one.  But if the YAML beats the image to the site, then they could be sitting there for a while, waiting for 15 minutes, while their critical website is down.  This can result in longer downtimes for our most critical customers.  This make the openshift product look bad.

      4. List any affected packages or components.                                                                            --->>> The imagestream/imagestreamtag kinds.

      5. Do you have any specific timeline, dependencies and which release would you like to target this feature request?

      --->>> From our perspective, this is an ongoing problem that causes problems for our biggest customer.  So sooner is better.

      I'm prepping to upgrade to 4.10, so if you could target that, it would be great.  But if I can get it faster if you target 4.11 or 4.12, then let's do that.

      6. How would you like to achieve this? (List the functional requirements here)

      --->>> My preference would be to add a "retryTime" integer to the specification of the imagestream or imagestream tag.  If there is any initial failure in pulling the image, keep retrying it at that frequency!  You might want to add a backoff, similar to the pod ImagePullBackOff. If that's too difficult, it would also be fine to have a simple "retry" boolean that enables the behavior with a global tunable, or a default retry behavior with a global tunable.

      7.  Would you be able to assist in testing this functionality, if implemented?

      --->>> Yes, I'd be willing to test.

      8.  Can you please share the urgency of this Feature Request, and what is the business impact?

      --->>> This pretty frequently causes annoyance for our highest-visibility customer, and causes concerns if the OpenShift product is operationally acceptable.  So it's very high business impact.

              DanielMesser Daniel Messer
              rhn-support-gakendre Gaurav Kendre
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: