Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-6779

[Model Serving] fallback image for ovms is not published, leading to image pull errors in upgrade scenarios

    XMLWordPrintable

Details

    • False
    • None
    • False
    • Release Notes
    • Testable
    • No
    • 1.22.0-2
    • Yes
    • Hide
      == Models failed to be served after upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21
      When upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21, the `modelmesh-serving` pod attempted to pull a non-existent image, causing an image pull error. As a result, models were unable to be served using the model serving feature in OpenShift Data Science. The `odh-openvino-servingruntime-container-v1.21.0-15` image now deploys successfully.
      Show
      == Models failed to be served after upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21 When upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21, the `modelmesh-serving` pod attempted to pull a non-existent image, causing an image pull error. As a result, models were unable to be served using the model serving feature in OpenShift Data Science. The `odh-openvino-servingruntime-container-v1.21.0-15` image now deploys successfully.
    • Bug Fix
    • No
    • Pending
    • None
    • RHODS 1.23

    Description

      Description of problem:

      During live testing of RHODS 1.21 we've noticed that two clusters where upgrade testing was performed are unable to serve models via the Model Serving feature.
      This is because the modelmesh-serving pod tries to pull a servingruntime image that does not exist:

      qeaisrhods-ukce (GCP 4.10 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767
      qeaisrhods-umne (AWS 4.11 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767 

      By comparison, a freshly installed cluster is pulling a different (available and working) image:

      qeaisrhods-frkq (AWS 4.11 fresh): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:8af20e48bb480a7ba1ee1268a3cf0a507e05b256c5fcf988f8e4a3de8b87edc6 

      The e79767 image appears to have been defined as a fallback in this PR , and got triggered in the upgrade clusters because of the model serving configmap not being correctly upgraded during the 1.20->1.21 upgrade:

      Error from server (AlreadyExists): error when creating "model-mesh/etcd-secrets.yaml": secrets "model-serving-etcd" already exists
      WARN: Model Mesh serving etcd connection secret was not created successfully.
      Error from server (AlreadyExists): error when creating "model-mesh/etcd-users.yaml": secrets "etcd-passwords" already exists
      WARN: Etcd user secret was not created successfully.
      secret/odh-segment-key unchanged 

      One possible solution would be to promote the odh-openvino-servingruntime-container-v1.21.0-15 image so that it can be pulled from production, avoiding an hotfix of 1.21, and rebuild the 1.22 RC to include a fix for the configmap upgrade bug that triggered this fallback image in the first place.

      Prerequisites (if any, like setup, operators/versions):

      Cluster upgrading from RHODS 1.20 to RHODS 1.21, where the model serving configmap upgrade failed (unclear what caused it in the first place, as the same issue was NOT observed in stage or with the self-managed release)

      Steps to Reproduce

      1. Install RHODS 1.20
      2. Upgrade to RHODS 1.21
      3. Try to deploy a model via Model Serving

      Actual results:

      the runtime pod fails because of an image pull error, and the model is never deployed

      Expected results:

      model is correctly deployed, pod can pull all the needed images

      Reproducibility (Always/Intermittent/Only Once):

      2/2 in OSD upgrade clusters, never in other fresh install clusters in production. Not seen during stage testing nor self-managed live testing either.

      Build Details:

      RHODS 1.21

      Workaround:

      One of

         a, Manually updating the serving_runtime_config.yaml with the correct version
         b, Forcing the deletion of serving_runtime_config.yaml before upgrading to 1.21
         c, Promote odh-openvino-servingruntime-container-v1.21.0-15 so it can be pulled from production.

      Live Build:

      quay.io/lferrnan/rhods-operator-live-catalog:1.22.0-rhods-ga

      PRs:

      Attachments

        Activity

          People

            lferrnan@redhat.com Lucas Fernandez Aragon
            rhn-support-lgiorgi Luca Giorgi
            Tarun Kumar Tarun Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: