-
Bug
-
Resolution: Done
-
Normal
-
OSSM 3.0.1
-
None
Observing a failure in the kiali operator logs where during the kiali deployment, we crash on the following task:
The error shown in the kiali operator logs which is the cause of the reconcilation failure and why Kiali is not installed is this: TASK [Get api version information from the cluster] ******************************** [0;31mfatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": -9} This comes from this task: https://github.com/kiali/kiali-operator/blob/v2.4.5/roles/v2.4/kiali-deploy/tasks/main.yml#L23-L25 name: Get api version information from the cluster k8s_cluster_info: register: api_status So very simple - it just calls "k8s_cluster_info" (found in the ansible collection "kubernetes.core") to obtain details from the cluster. The key part in the return data is likely the return code: "rc": -9. When using python, if you see a negative rc for a command, it means that the processed was killed via that signal. So it got a SIGKILL, which could very well be the kernel handling an OOM situation. This also likely explains why there is no stderr or stdout captured - because the process was externally terminated and the output never got flushed. However: All nodes have 32GB RAM and the highest utilization is 19GB RAM. Nothing appears to be oversubscribed.
To debug the issue I have requested the following:
As a test, I'd like to have you create the kiali instance again: Using the following options with the kiali operator prior to deploying the kiali instance so we can get profiling and debug logging on the calls: https://kiali.io/docs/faq/installation/#operator-configuration ANSIBLE_CONFIG: must be /etc/ansible/ansible.cfg or /opt/ansible/ansible-profiler.cfg. If set to /opt/ansible/ansible-profiler.cfg a profiler report will be dumped in the operator logs after each reconciliation run. ---> we want to set it to the `opt/ansible/ansible-profiler.cfg` I have a theory also that the node this operator is running on might be limited in resource capacity/expansion for the processes on the node - does the behavior change if we move the operator pod to a different host? The profiler logging however will tell us how long we spent on the action that's crashing out and may shed some light on where the issue is occurring. We'd need to set this env-var in the subscription for the kiali-operator so the deployment comes up with this value injected into the operator pod. See here: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#env This may mean uninstalling kiali instance and kiali operator, then re-installing the operator and re-deploying kiali instance in istio-system so we can see what's causing that repeated crash. Best guess at present is resourcing on the hosts where the operator is running and/or timeout in getting data back from the cluster (apiserver latency).
//Ask for engineering - assistance is needed in identifying why Kiali is unable to deploy (continually aborts the playbook on the kiali creation) - Logs, inspects and gathers are attached below.