Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-1232

modelmesh controller have connection error logs when kserve and ModelMesh run in same namespace

XMLWordPrintable

    • Sprint 2.6, Model Serving Sprint 2.7, Model Serving Sprint 2.8, Model Serving Sprint 2.9-1, Model Serving Sprint 2.9-2, Model Serving Sprint Q2-2, Model Serving Sprint Q2-3

      When kserve and modelmeh are running in the same namespace, modelmesh controller show these errors:

      {"level":"error","ts":"2023-12-07T11:33:47Z","msg":"Reconciler error","controller":"predictor","controllerGroup":"serving.kserve.io","controllerKind":"Predictor","Predictor":{"name":"caikit-tgis-example-isvc","namespace":"isvc_kserve-demo"},"namespace":"isvc_kserve-demo","name":"caikit-tgis-example-isvc","reconcileID":"868aa907-1733-408b-a8cd-482ac234f616","error":"failed to remove corresponding VModel for deleted Predictor kserve-demo/caikit-tgis-example-isvc: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.128.0.84:8033: i/o timeout\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\ns...
      {"level":"error","ts":"2023-12-07T11:33:47Z","msg":"Reconciler error","controller":"predictor","controllerGroup":"serving.kserve.io","controllerKind":"Predictor","Predictor":{"name":"example-onnx-mnist","namespace":"isvc_kserve-demo"},"namespace":"isvc_kserve-demo","name":"example-onnx-mnist","reconcileID":"4736d0d4-e010-4915-a537-07634c94d85f","error":"failed to SetVModel for InferenceService example-onnx-mnist: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.128.0.84:8033: i/o timeout\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/contro...
       

      because kserve-demo namespace is a member of ServiceMeshMemberRole due to which traffic is not passing from modelmesh-controller pod to modelmesh runtime pod. Below NetworkPolicy could be created in kserve-demo namespace which allows traffic from opendatahub namespace.

      kind: NetworkPolicy
      apiVersion: networking.k8s.io/v1
      metadata:
        name: allow-from-opendatahub-ns
        namespace: kserve-demo
      spec:
        podSelector: {}
        ingress:
          - from:
              - namespaceSelector:
                  matchLabels:
                    kubernetes.io/metadata.name: opendatahub
        policyTypes:
          - Ingress  

      Please follow below thread for more details :
      https://redhat-internal.slack.com/archives/C065ARTVA80/p1702293019814919?thread_ts=1701693652.733169&cid=C065ARTVA80

            vajain Vaibhav Jain
            vajain Vaibhav Jain
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: