Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-9659

kiali operator fails to deploy kiali instance successfully with suspected memory error during playbook operation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • OSSM 3.0.3, OSSM 3.1.0
    • OSSM 3.0.1
    • Kiali
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Previously the Kiali operator used the Ansible module "k8s_cluster_info" from the "kubernetes.core" collection. This module may fail on some environments with the error occurring in the Ansible task "Get api version information from the cluster". The module fails with a result code of -9. This causes the operator to not be able to reconcile Kiali CR resources. This fix removes the call to the "k8s_cluster_info" Ansible module, thus avoiding this error situation.
      Show
      Previously the Kiali operator used the Ansible module "k8s_cluster_info" from the "kubernetes.core" collection. This module may fail on some environments with the error occurring in the Ansible task "Get api version information from the cluster". The module fails with a result code of -9. This causes the operator to not be able to reconcile Kiali CR resources. This fix removes the call to the "k8s_cluster_info" Ansible module, thus avoiding this error situation.
    • Bug Fix
    • Proposed
    • Important

      Observing a failure in the kiali operator logs where during the kiali deployment, we crash on the following task:

       

      The error shown in the kiali operator logs which is the cause of the reconcilation failure and why Kiali is not installed is this:
      
       TASK [Get api version information from the cluster] ******************************** 
      [0;31mfatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": -9}
      
      
      This comes from this task: https://github.com/kiali/kiali-operator/blob/v2.4.5/roles/v2.4/kiali-deploy/tasks/main.yml#L23-L25
      
      name: Get api version information from the cluster
      k8s_cluster_info:
      register: api_status
      
      So very simple - it just calls "k8s_cluster_info" (found in the ansible collection "kubernetes.core") to obtain details from the cluster.
      
      The key part in the return data is likely the return code: "rc": -9.  When using python, if you see a negative rc for a command, it means that the processed was killed via that signal.  So it got a SIGKILL, which could very well be the kernel handling an OOM situation. This also likely explains why there is no stderr or stdout captured - because the process was externally terminated and the output never got flushed.
      
      However: All nodes have 32GB RAM and the highest utilization is 19GB RAM. Nothing appears to be oversubscribed.

       

      To debug the issue I have requested the following:

       

      As a test, I'd like to have you create the kiali instance again:
      
      Using the following options with the kiali operator prior to deploying the kiali instance so we can get profiling and debug logging on the calls:
      
      https://kiali.io/docs/faq/installation/#operator-configuration
      ANSIBLE_CONFIG: must be /etc/ansible/ansible.cfg or /opt/ansible/ansible-profiler.cfg. 
      
      If set to /opt/ansible/ansible-profiler.cfg a profiler report will be dumped in the operator logs after each reconciliation run.
      
      ---> we want to set it to the `opt/ansible/ansible-profiler.cfg`
      
      I have a theory also that the node this operator is running on might be limited in resource capacity/expansion for the processes on the node - does the behavior change if we move the operator pod to a different host? 
      
      The profiler logging however will tell us how long we spent on the action that's crashing out and may shed some light on where the issue is occurring. 
      We'd need to set this env-var in the subscription for the kiali-operator so the deployment comes up with this value injected into the operator pod.
      See here:
      https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#env
      
      This may mean uninstalling kiali instance and kiali operator, then re-installing the operator and re-deploying kiali instance in istio-system so we can see what's causing that repeated crash. Best guess at present is resourcing on the hosts where the operator is running and/or timeout in getting data back from the cluster (apiserver latency).  

      //Ask for engineering - assistance is needed in identifying why Kiali is unable to deploy (continually aborts the playbook on the kiali creation) - Logs, inspects and gathers are attached below.

              jmazzitelli John Mazzitelli
              rhn-support-wrussell Will Russell
              John Mazzitelli, Joseph Phillips
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: