-
Bug
-
Resolution: Done
-
Blocker
-
None
-
False
-
None
-
False
-
Release Notes
-
Testable
-
No
-
-
-
-
-
-
-
1.22.0-2
-
Yes
-
-
Bug Fix
-
No
-
Pending
-
None
-
-
-
RHODS 1.23
Description of problem:
During live testing of RHODS 1.21 we've noticed that two clusters where upgrade testing was performed are unable to serve models via the Model Serving feature.
This is because the modelmesh-serving pod tries to pull a servingruntime image that does not exist:
qeaisrhods-ukce (GCP 4.10 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767
qeaisrhods-umne (AWS 4.11 upgrd): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:7ef272bc7be866257b8126620e139d6e915ee962304d3eceba9c9d50d4e79767
By comparison, a freshly installed cluster is pulling a different (available and working) image:
qeaisrhods-frkq (AWS 4.11 fresh): registry.redhat.io/rhods/odh-openvino-servingruntime-rhel8@sha256:8af20e48bb480a7ba1ee1268a3cf0a507e05b256c5fcf988f8e4a3de8b87edc6
The e79767 image appears to have been defined as a fallback in this PR , and got triggered in the upgrade clusters because of the model serving configmap not being correctly upgraded during the 1.20->1.21 upgrade:
Error from server (AlreadyExists): error when creating "model-mesh/etcd-secrets.yaml": secrets "model-serving-etcd" already exists WARN: Model Mesh serving etcd connection secret was not created successfully. Error from server (AlreadyExists): error when creating "model-mesh/etcd-users.yaml": secrets "etcd-passwords" already exists WARN: Etcd user secret was not created successfully. secret/odh-segment-key unchanged
One possible solution would be to promote the odh-openvino-servingruntime-container-v1.21.0-15 image so that it can be pulled from production, avoiding an hotfix of 1.21, and rebuild the 1.22 RC to include a fix for the configmap upgrade bug that triggered this fallback image in the first place.
Prerequisites (if any, like setup, operators/versions):
Cluster upgrading from RHODS 1.20 to RHODS 1.21, where the model serving configmap upgrade failed (unclear what caused it in the first place, as the same issue was NOT observed in stage or with the self-managed release)
Steps to Reproduce
- Install RHODS 1.20
- Upgrade to RHODS 1.21
- Try to deploy a model via Model Serving
Actual results:
the runtime pod fails because of an image pull error, and the model is never deployed
Expected results:
model is correctly deployed, pod can pull all the needed images
Reproducibility (Always/Intermittent/Only Once):
2/2 in OSD upgrade clusters, never in other fresh install clusters in production. Not seen during stage testing nor self-managed live testing either.
Build Details:
RHODS 1.21
Workaround:
One of
a, Manually updating the serving_runtime_config.yaml with the correct version
b, Forcing the deletion of serving_runtime_config.yaml before upgrading to 1.21
c, Promote odh-openvino-servingruntime-container-v1.21.0-15 so it can be pulled from production.
Live Build:
quay.io/lferrnan/rhods-operator-live-catalog:1.22.0-rhods-ga
PRs:
- links to
- mentioned on