Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-1005

Increase etcd resources for sandbox installation

XMLWordPrintable

      Is your feature request related to a problem? If so, please describe.

      Yes.
      etcd deployment included with modelmesh does claim any dedicated storage. If the etcd pod restarts or if other processes have high i/o on the shared disk etcd can have slow response times leading to failure to serve models.

      Describe your proposed solution

      Use the etcd recommended practices to modify the etcd deployment to make it more resilient to pod restarts and added latency from other cluster disk iops.

      Additional context
      Following resources are about a particular incident where cluster disk iops potentially affected modelmesh etcd response times and caused a failure in modelmesh-controller.

      Slack thread : https://redhat-internal.slack.com/archives/CNNFPNXBR/p1684510473409899
      Cluster event logs :
       
      Modelmesh controller logs :

      PodPmodelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97
      NamespaceNSrht-jramirez-dev
      May 19, 2023, 3:21 PM
      Generated from kubelet on ip-10-0-172-120.us-east-2.compute.internal
      Exec lifecycle hook ([/opt/kserve/mmesh/stop.sh wait]) for Container "mm" in Pod "modelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97_rht-jramirez-dev(4d7b0ffb-f064-4e00-8bce-eae89e400bec)" failed - error: command '/opt/kserve/mmesh/stop.sh wait' exited with 137: , message: "waiting for litelinks process to exit after server shutdown triggered\n" 

      etcd logs :

      {"level":"error","ts":1684505461.8559515,"logger":"controller.predictor","msg":"Reconciler error","reconciler group":"serving.kserve.io","reconciler kind":"Predictor","name":"test2","namespace":"isvc_rht-jramirez-dev","error":"failed to SetVModel for InferenceService test2: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.130.13.77:8033: i/o timeout\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}  
      2023-04-01 08:39:32.617638 W | etcdserver: read-only range request "key:\"modelmesh-serving/mm_ns/fjuma1-dev/mm/modelmesh-serving/vmodels/\" " with result "range_response_count:0 size:4" took too long (151.354562ms) to execute 

      ____________________________________________________

      NOTE: This task is migrated from GitHub to Jira. See previous discussion on GH.

        1. modelmesh-controller-869b44f89c-459t7-manager.log
          2.65 MB
          Vedant Mahabaleshwarkar
        2. etcd-65c8cb4797-zrc97-etcd.log
          15.16 MB
          Vedant Mahabaleshwarkar

            vmahabal@redhat.com Vedant Mahabaleshwarkar
            vmahabal@redhat.com Vedant Mahabaleshwarkar
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: