Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: RHODS_1.20.0_GA
Affects Version/s: None
Component/s: Model Serving
Labels:
- MLOps
- groomed

Blocked:
False
Blocked Reason:
None
Ready:
False
Affects Testing:

Testable
Automated:
Yes
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.20.0-z
Regression:
No
Target Release:

RHODS_1.20.0_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-manifests/pull/274
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When configuring a server pod in a DSP for model serving, the pod that is created in the project namespace will often go in CrashLoopBackOff because of failure in the "mm" container regarding the connection to the etcd pod deployed in redhat-ods-applications. This is sometimes fixed automatically when the pod tries restarting, other times it can only be fixed by manually deleting the pod or scaling down/up the deployment, other times nothing seems to fix it.

Prerequisites (if any, like setup, operators/versions):

Latest model serving live build

Steps to Reproduce

Create DSP
Configure Server
Go into DSP namespace and check server pod status

Actual results:

pod is often found in CrashLoopBackOff status (more than 50% of the time)

Expected results:

pod is always deployed successfully

Reproducibility (Always/Intermittent/Only Once):

Intermittent but frequent

Build Details:

Workaround:

Delete pod and wait for RS/Deployment to bring it back up, or scale down/up the number of pods from the deployment. This sometimes fixes the problem, but not always.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2022-11-28-16-26-57-761.png
37 kB
2022/11/28 3:26 PM
mmlog.txt
29 kB
2022/11/28 3:27 PM

mentioned on

Merge request - Updated 2 upstream sources

Assignee:: Humair Khan

Reporter:: Luca Giorgi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/11/28 3:02 PM

Updated:: 2023/01/20 1:12 AM

Resolved:: 2022/11/28 7:03 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates