Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-8812

Model Serving etcd takes too long to respond

XMLWordPrintable

    • False
    • None
    • False
    • Testable
    • No
    • No
    • No
    • Pending
    • None

      A user failed to deploy a model in dev sandbox. The only error visible to the user was through openshift events. 

      PodPmodelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97
      NamespaceNSrht-jramirez-dev
      May 19, 2023, 3:21 PM
      Generated from kubelet on ip-10-0-172-120.us-east-2.compute.internal
      Exec lifecycle hook ([/opt/kserve/mmesh/stop.sh wait]) for Container "mm" in Pod "modelmesh-serving-model-server-rht-jramirez-dev-7d56c5f7fc94f97_rht-jramirez-dev(4d7b0ffb-f064-4e00-8bce-eae89e400bec)" failed - error: command '/opt/kserve/mmesh/stop.sh wait' exited with 137: , message: "waiting for litelinks process to exit after server shutdown triggered\n" 

      Looking at the controllers, the modelmesh-controller has the following logs. 

      {"level":"error","ts":1684505461.8559515,"logger":"controller.predictor","msg":"Reconciler error","reconciler group":"serving.kserve.io","reconciler kind":"Predictor","name":"test2","namespace":"isvc_rht-jramirez-dev","error":"failed to SetVModel for InferenceService test2: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.130.13.77:8033: i/o timeout\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 \t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"} 

      the etcd logs are full of similar lines 

      2023-04-01 08:39:32.617638 W | etcdserver: read-only range request "key:\"modelmesh-serving/mm_ns/fjuma1-dev/mm/modelmesh-serving/vmodels/\" " with result "range_response_count:0 size:4" took too long (151.354562ms) to execute 

      My suspicion is that etcd is not scaling well in the sandbox

              Unassigned Unassigned
              vmahabal@redhat.com Vedant Mahabaleshwarkar
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: