-
Bug
-
Resolution: Done-Errata
-
Major
-
RHOAI_2.5.0, RHOAI_2.6.0, RHOAI_2.7.0, RHOAI_2.8.0
Is your feature request related to a problem? If so, please describe.
Yes.
etcd deployment included with modelmesh does claim any dedicated storage. If the etcd pod restarts or if other processes have high i/o on the shared disk etcd can have slow response times leading to failure to serve models.
Describe your proposed solution
Use the etcd recommended practices to modify the etcd deployment to make it more resilient to pod restarts and added latency from other cluster disk iops.
Additional context
Following resources are about a particular incident where cluster disk iops potentially affected modelmesh etcd response times and caused a failure in modelmesh-controller.
Slack thread : https://redhat-internal.slack.com/archives/CNNFPNXBR/p1684510473409899
Cluster event logs :
Modelmesh controller logs :
PodPmodelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97 NamespaceNSrht-jramirez-dev May 19, 2023, 3:21 PM Generated from kubelet on ip-10-0-172-120.us-east-2.compute.internal Exec lifecycle hook ([/opt/kserve/mmesh/stop.sh wait]) for Container "mm" in Pod "modelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97_rht-jramirez-dev(4d7b0ffb-f064-4e00-8bce-eae89e400bec)" failed - error: command '/opt/kserve/mmesh/stop.sh wait' exited with 137: , message: "waiting for litelinks process to exit after server shutdown triggered\n"
etcd logs :
{"level":"error","ts":1684505461.8559515,"logger":"controller.predictor","msg":"Reconciler error","reconciler group":"serving.kserve.io","reconciler kind":"Predictor","name":"test2","namespace":"isvc_rht-jramirez-dev","error":"failed to SetVModel for InferenceService test2: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.130.13.77:8033: i/o timeout\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
2023-04-01 08:39:32.617638 W | etcdserver: read-only range request "key:\"modelmesh-serving/mm_ns/fjuma1-dev/mm/modelmesh-serving/vmodels/\" " with result "range_response_count:0 size:4" took too long (151.354562ms) to execute
____________________________________________________
NOTE: This task is migrated from GitHub to Jira. See previous discussion on GH.
- links to
-
RHBA-2024:128688 RHOAI 2.9.0 - Red Hat OpenShift AI
- mentioned on