Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: RHODS_1.22.0_GA
Affects Version/s: None
Component/s: UI
Labels:
- UI
- eng
- groomed

Blocked:
False
Blocked Reason:
None
Ready:
False
Affects:

Release Notes
Affects Testing:

Testable
Automated:
No
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.22.0-2
Regression:
Yes
Release Note Text:

Hide
== Models failed to be served after upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21
When upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21, the `modelmesh-serving` pod attempted to pull a non-existent image, causing an image pull error. As a result, models were unable to be served using the model serving feature in OpenShift Data Science. The `odh-openvino-servingruntime-container-v1.21.0-15` image now deploys successfully.

Show
== Models failed to be served after upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21 When upgrading from OpenShift Data Science 1.20 to OpenShift Data Science 1.21, the `modelmesh-serving` pod attempted to pull a non-existent image, causing an image pull error. As a result, models were unable to be served using the model serving feature in OpenShift Data Science. The `odh-openvino-servingruntime-container-v1.21.0-15` image now deploys successfully.
Release Note Type:
Bug Fix
Target Release:

RHODS_1.22.0_GA
Test Blocker:
No
Test Coverage:

Pending
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-dashboard/pull/259, https://github.com/red-hat-data-services/odh-deployer/pull/305, https://github.com/red-hat-data-services/odh-deployer/pull/304
Intelligence Requested:
Market:

Sprint:
RHODS 1.23

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

During live testing of RHODS 1.21 we've noticed that two clusters where upgrade testing was performed are unable to serve models via the Model Serving feature.
This is because the modelmesh-serving pod tries to pull a servingruntime image that does not exist:

qeaisrhods-ukce (GCP 4.10 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767

qeaisrhods-umne (AWS 4.11 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767

By comparison, a freshly installed cluster is pulling a different (available and working) image:

qeaisrhods-frkq (AWS 4.11 fresh): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:8af20e48bb480a7ba1ee1268a3cf0a507e05b256c5fcf988f8e4a3de8b87edc6

The e79767 image appears to have been defined as a fallback in this PR , and got triggered in the upgrade clusters because of the model serving configmap not being correctly upgraded during the 1.20->1.21 upgrade:

Error from server (AlreadyExists): error when creating "model-mesh/etcd-secrets.yaml": secrets "model-serving-etcd" already exists
WARN: Model Mesh serving etcd connection secret was not created successfully.
Error from server (AlreadyExists): error when creating "model-mesh/etcd-users.yaml": secrets "etcd-passwords" already exists
WARN: Etcd user secret was not created successfully.
secret/odh-segment-key unchanged

One possible solution would be to promote the odh-openvino-servingruntime-container-v1.21.0-15 image so that it can be pulled from production, avoiding an hotfix of 1.21, and rebuild the 1.22 RC to include a fix for the configmap upgrade bug that triggered this fallback image in the first place.

Prerequisites (if any, like setup, operators/versions):

Cluster upgrading from RHODS 1.20 to RHODS 1.21, where the model serving configmap upgrade failed (unclear what caused it in the first place, as the same issue was NOT observed in stage or with the self-managed release)

Steps to Reproduce

Install RHODS 1.20
Upgrade to RHODS 1.21
Try to deploy a model via Model Serving

Actual results:

the runtime pod fails because of an image pull error, and the model is never deployed

Expected results:

model is correctly deployed, pod can pull all the needed images

Reproducibility (Always/Intermittent/Only Once):

2/2 in OSD upgrade clusters, never in other fresh install clusters in production. Not seen during stage testing nor self-managed live testing either.

Build Details:

RHODS 1.21

Workaround:

One of

a, Manually updating the serving_runtime_config.yaml with the correct version
b, Forcing the deletion of serving_runtime_config.yaml before upgrading to 1.21
c, Promote odh-openvino-servingruntime-container-v1.21.0-15 so it can be pulled from production.