Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: OSSM 3.0.3, OSSM 3.1.0
Affects Version/s: OSSM 3.0.1
Component/s: Kiali
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Text:

Hide
Previously the Kiali operator used the Ansible module "k8s_cluster_info" from the "kubernetes.core" collection. This module may fail on some environments with the error occurring in the Ansible task "Get api version information from the cluster". The module fails with a result code of -9. This causes the operator to not be able to reconcile Kiali CR resources. This fix removes the call to the "k8s_cluster_info" Ansible module, thus avoiding this error situation.

Show
Previously the Kiali operator used the Ansible module "k8s_cluster_info" from the "kubernetes.core" collection. This module may fail on some environments with the error occurring in the Ansible task "Get api version information from the cluster". The module fails with a result code of -9. This causes the operator to not be able to reconcile Kiali CR resources. This fix removes the call to the "k8s_cluster_info" Ansible module, thus avoiding this error situation.
Release Note Type:
Bug Fix
Release Note Status:
Proposed
Git Pull Request:
https://github.com/kiali/kiali-operator/pull/918
Intelligence Requested:
Market:

Severity:
Important

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Observing a failure in the kiali operator logs where during the kiali deployment, we crash on the following task:

The error shown in the kiali operator logs which is the cause of the reconcilation failure and why Kiali is not installed is this:

 TASK [Get api version information from the cluster] ******************************** 
[0;31mfatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": -9}


This comes from this task: https://github.com/kiali/kiali-operator/blob/v2.4.5/roles/v2.4/kiali-deploy/tasks/main.yml#L23-L25

name: Get api version information from the cluster
k8s_cluster_info:
register: api_status

So very simple - it just calls "k8s_cluster_info" (found in the ansible collection "kubernetes.core") to obtain details from the cluster.

The key part in the return data is likely the return code: "rc": -9.  When using python, if you see a negative rc for a command, it means that the processed was killed via that signal.  So it got a SIGKILL, which could very well be the kernel handling an OOM situation. This also likely explains why there is no stderr or stdout captured - because the process was externally terminated and the output never got flushed.

However: All nodes have 32GB RAM and the highest utilization is 19GB RAM. Nothing appears to be oversubscribed.

To debug the issue I have requested the following:

As a test, I'd like to have you create the kiali instance again:

Using the following options with the kiali operator prior to deploying the kiali instance so we can get profiling and debug logging on the calls:

https://kiali.io/docs/faq/installation/#operator-configuration
ANSIBLE_CONFIG: must be /etc/ansible/ansible.cfg or /opt/ansible/ansible-profiler.cfg. 

If set to /opt/ansible/ansible-profiler.cfg a profiler report will be dumped in the operator logs after each reconciliation run.

---> we want to set it to the `opt/ansible/ansible-profiler.cfg`

I have a theory also that the node this operator is running on might be limited in resource capacity/expansion for the processes on the node - does the behavior change if we move the operator pod to a different host? 

The profiler logging however will tell us how long we spent on the action that's crashing out and may shed some light on where the issue is occurring. 
We'd need to set this env-var in the subscription for the kiali-operator so the deployment comes up with this value injected into the operator pod.
See here:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#env

This may mean uninstalling kiali instance and kiali operator, then re-installing the operator and re-deploying kiali instance in istio-system so we can see what's causing that repeated crash. Best guess at present is resourcing on the hosts where the operator is running and/or timeout in getting data back from the cluster (apiserver latency).

//Ask for engineering - assistance is needed in identifying why Kiali is unable to deploy (continually aborts the playbook on the kiali creation) - Logs, inspects and gathers are attached below.

links to

openshift/openshift-docs#95080: OSSM-9255 OSSM 3.0.3 (At Stage): [DOC] Release Notes

Assignee:: John Mazzitelli

Reporter:: Will Russell

Contributors:: John Mazzitelli, Joseph Phillips

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/05/22 7:56 PM

Updated:: 2025/09/14 12:28 AM

Resolved:: 2025/06/20 12:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates