-
Bug
-
Resolution: Done
-
Major
-
RHODS_1.20.0_GA
-
1
-
False
-
None
-
False
-
Testable
-
Yes
-
-
-
-
-
-
-
1.21.0-z
-
No
-
No
-
Yes
-
None
-
-
-
ML Serving Sprint 1.22, ML Serving Sprint 1.23, ML Serving Sprint 1.24
-
High
Description of problem:
On clusters with very high resources, we observed that odh-model-controller pod is hitting OOM issue.
Performance test was the default toolchain-e2e test for the sandbox. This will create lots of users, namespaces, and users.
https://github.com/codeready-toolchain/toolchain-e2e/tree/master/setup
Prerequisites (if any, like setup, operators/versions):
Steps to Reproduce
- Create cluster with master nodes m5.12xlarge
- Install RHODS
- run the tests
go run setup/main.go --users 2000 --default 2000 --custom 0 --username "user${RANDOM_NAME}" --workloads redhat-ods-operator:rhods-operator --workloads redhat-ods-applications:rhods-dashboard --workloads redhat-ods-applications:notebook-controller-deployment --workloads redhat-ods-applications:odh-notebook-controller-manager --workloads redhat-ods-applications:modelmesh-controller --workloads redhat-ods-applications:etcd --workloads redhat-ods-applications:odh-model-controller --workloads redhat-ods-monitoring:blackbox-exporter --workloads redhat-ods-monitoring:rhods-prometheus-operator --workloads redhat-ods-monitoring:prometheus
Actual results:
Average odh-model-controller CPU Usage: 0.0012
Max odh-model-controller CPU Usage: 0.0025
Average odh-model-controller Memory Usage: 21.42 MB
Max odh-model-controller Memory Usage: 46.12 MB
Reproducibility (Always/Intermittent/Only Once):
Always
Build Details:
RHODS 1.20.0-14 ('brew.registry.redhat.io/rh-osbs/iib:395124')
Workaround:
Additional info:
Pods logs:
1214 09:11:46.994085 1 request.go:601] Waited for 1.035737371s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/route.openshift.io/v1?timeout=32s 1.6710091098061335e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} 1.6710091098064e+09 INFO setup starting manager 1.6710091098065906e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.6710091098065984e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} I1214 09:11:49.806644 1 leaderelection.go:248] attempting to acquire leader lease redhat-ods-applications/odh-model-controller... I1214 09:12:06.712832 1 leaderelection.go:258] successfully acquired lease redhat-ods-applications/odh-model-controller 1.6710091267129855e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1beta1.InferenceService"} 1.671009126713024e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1alpha1.ServingRuntime"} 1.67100912671303e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Namespace"} 1.6710091267129595e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"redhat-ods-applications","name":"odh-model-controller","uid":"ceaa6dab-6c60-43f7-8850-a6c417dd7c4a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"609836"}, "reason": "LeaderElection", "message": "odh-model-controller-5cc9dbb6cb-n2slb_3b64dfaa-8c41-4e37-9aba-41b5392b0f62 became leader"} 1.671009126713036e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Route"} 1.6710091267130482e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.ServiceAccount"} 1.6710091267130663e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Service"} 1.6710091267130752e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Secret"} 1.671009126713082e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.ClusterRoleBinding"} 1.6710091267130685e+09 INFO Starting EventSource {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "kind source: *v1.Secret"} 1.671009126713098e+09 INFO Starting Controller {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"} 1.6710091267130897e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1alpha1.ServingRuntime"} 1.6710091267131064e+09 INFO Starting Controller {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService"}
overall report:
:popcorn: provisioning users... user signups (800/2000) [==========================>-----------------------------------------] 40% idler setup (798/2000) [=========================>------------------------------------------] 40% setup users with default template (186/2000) [====>---------------------------------------------------------------] 9% metrics error: metrics value could not be retrieved for query odh-model-controller CPU Usage Average Cluster CPU Utilisation: 4.82 % Max Cluster CPU Utilisation: 6.14 % Average Cluster Memory Utilisation: 6.49 % Max Cluster Memory Utilisation: 6.79 % Average Node Memory Usage: 9.42 % Max Node Memory Usage: 9.95 % Average etcd Instance Memory Usage: 852.13 MB Max etcd Instance Memory Usage: 1206.66 MB Average olm-operator CPU Usage: 0.1373 Max olm-operator CPU Usage: 0.2340 Average olm-operator Memory Usage: 499.24 MB Max olm-operator Memory Usage: 552.52 MB Average openshift-kube-apiserver: 20088.71 MB Max openshift-kube-apiserver: 21178.97 MB Average apiserver CPU Usage: 0.8006 Max apiserver CPU Usage: 1.2799 Average apiserver Memory Usage: 677.55 MB Max apiserver Memory Usage: 760.98 MB Average host-operator-controller-manager CPU Usage: 0.0310 Max host-operator-controller-manager CPU Usage: 0.0442 Average host-operator-controller-manager Memory Usage: 122.26 MB Max host-operator-controller-manager Memory Usage: 140.29 MB Average member-operator-controller-manager CPU Usage: 0.0571 Max member-operator-controller-manager CPU Usage: 0.1064 Average member-operator-controller-manager Memory Usage: 322.09 MB Max member-operator-controller-manager Memory Usage: 396.97 MB Average rhods-operator CPU Usage: 0.0669 Max rhods-operator CPU Usage: 0.0950 Average rhods-operator Memory Usage: 278.03 MB Max rhods-operator Memory Usage: 294.56 MB Average rhods-dashboard CPU Usage: 0.0033 Max rhods-dashboard CPU Usage: 0.0041 Average rhods-dashboard Memory Usage: 140.12 MB Max rhods-dashboard Memory Usage: 144.39 MB Average notebook-controller-deployment CPU Usage: 0.0023 Max notebook-controller-deployment CPU Usage: 0.0030 Average notebook-controller-deployment Memory Usage: 70.65 MB Max notebook-controller-deployment Memory Usage: 75.94 MB Average odh-notebook-controller-manager CPU Usage: 0.0048 Max odh-notebook-controller-manager CPU Usage: 0.0069 Average odh-notebook-controller-manager Memory Usage: 160.38 MB Max odh-notebook-controller-manager Memory Usage: 239.06 MB Average modelmesh-controller CPU Usage: 0.0024 Max modelmesh-controller CPU Usage: 0.0029 Average modelmesh-controller Memory Usage: 100.38 MB Max modelmesh-controller Memory Usage: 128.35 MB Average etcd CPU Usage: 0.0034 Max etcd CPU Usage: 0.0038 Average etcd Memory Usage: 23.05 MB Max etcd Memory Usage: 23.54 MB Average odh-model-controller CPU Usage: 0.0012 Max odh-model-controller CPU Usage: 0.0025 Average odh-model-controller Memory Usage: 21.42 MB Max odh-model-controller Memory Usage: 46.12 MB Average blackbox-exporter CPU Usage: 0.0029 Max blackbox-exporter CPU Usage: 0.0039 Average blackbox-exporter Memory Usage: 48.92 MB Max blackbox-exporter Memory Usage: 57.42 MB Average rhods-prometheus-operator CPU Usage: 0.0021 Max rhods-prometheus-operator CPU Usage: 0.0029 Average rhods-prometheus-operator Memory Usage: 39.61 MB Max rhods-prometheus-operator Memory Usage: 45.39 MB Average prometheus CPU Usage: 0.0095 Max prometheus CPU Usage: 0.0110 Average prometheus Memory Usage: 193.33 MB Max prometheus Memory Usage: 217.10 MB
- is duplicated by
-
RHODS-6273 Fix performance issues in odh model controller
- Closed