Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7196

The collectors (Vector) restart impacts in the KubeAPI making it unavailable

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • Logging 6.3.0
    • Logging 5.8.z, Logging 5.9.z, Logging 6.0.z, Logging 6.1.z, Logging 6.2.z, Logging 6.3.z
    • Log Collection
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Hide
      This change configures vector to enable the kube api-server caching option and the daemonset rollout strategy `maxUnavailable` when restarting the collector pods. This is a tech-preview feature which can significantly reduce control plane memory pressure in exchange for delayed rolling updates and potentially stale kube data.
      Show
      This change configures vector to enable the kube api-server caching option and the daemonset rollout strategy `maxUnavailable` when restarting the collector pods. This is a tech-preview feature which can significantly reduce control plane memory pressure in exchange for delayed rolling updates and potentially stale kube data.
    • Bug Fix
    • Log Collection - Sprint 272, Log Collection - Sprint 273
    • Critical
    • Customer Escalated

      Description of problem:

      With the collector pods (Vector) restart, the control-plane is impacted going to unavailable as the number of requests to the API is highly increased for the requests received by the collectors when restarted. Better visibility is in the next dashboards:

      Api requests:

      Cpu and memory usage in the control plane:

      Version-Release number of selected component (if applicable):

      How reproducible:

      Every time that the collector pods are restarted manually or by the Logging Operator for applying any change

      Some information:

      $ oc get no --no-headers |wc -l
      40
      $ oc get po A  -no-headers|wc -l
      4162
      $ oc get ns  --no-headers|wc -l
      473

      Number of "inputs" in the clusterLogForwarder: 38

      Steps to Reproduce:

      In the environments affected, every time that restarted, the KubeAPI is impacted

      Actual results:

      The KubeAPI returns timeouts

      Expected results:

      The KubeAPI and control plane work normally and the restart of the collector pods don't impact in the control-planes (KubeAPI)

      Data needed to collect:

      • number of pods "oc get pods -A|wc -l"
      • number of namespace "oc get ns|wc -l"
      • Logging Operator version "oc get csv |grep -i logging"
      • clusterLogForwarder
      • Dashboard avaiable in "OpenShift Console > Observe > Dashboards > Dashboard: OpenShift Logging / Collection"

      Possible workaround to test until the RCA is not found and resolved:

      Move to Unmanaged the clusterLogForwarder CR. This will avoid to be restarted all the collector pods by the operator when a change in the Logging configuration is applied, but it will avoid:

      • Update the Logging stack as the Operator won't consider to manage the resources
      • Any update in the configuration won't be applied as the Logging Operator is not managing the resources

        1. screenshot-1.png
          442 kB
          Oscar Casal Sanchez
        2. image-2025-05-27-10-18-49-213.png
          169 kB
          Oscar Casal Sanchez

              cahartma@redhat.com Casey Hartman
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              7 Vote for this issue
              Watchers:
              34 Start watching this issue

                Created:
                Updated:
                Resolved: