Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2217

[Vector] Loss of logs when using Vector as collector.

XMLWordPrintable

    • False
    • False
    • NEW
    • VERIFIED
    • Logging (Core) - Sprint 214, Logging (Core) - Sprint 215, Logging (Core) - Sprint 216

      When sending logs at a high volume and rate, loss of logs at round 50% is observed. 

      Steps to reproduce the issue:

      1 Install Logging and Elasticsearch operator 5.5 - preview.

      2 Create a ClusterLoggingInstance with Vector as collector. 

      3 Create a log producer pod which sends 4500000 of logs with a length of 1024 and a rate of 150000. 

      oc new-project logtesta0
      
      oc label nodes --all placement=logtest
      
      cat cm.yaml 
      
      apiVersion: v1
      data:
        ocp_logtest.cfg: |
          --num-lines 4500000 --line-length 1024 --word-length 9 --rate 150000 --fixed-line
      kind: ConfigMap
      metadata:
        name: logtest-config
        namespace: logtesta0
       
      oc create -f cm.yaml
      
      cat rc.yaml 
      
      apiVersion: v1
      kind: ReplicationController
      metadata:
        generation: 1
        labels:
          run: centos-logtest
          test: centos-logtest
        name: centos-logtest
        namespace: logtesta0
      spec:
        replicas: 1
        selector:
          run: centos-logtest
          test: centos-logtest
        template:
          metadata:
            generateName: centos-logtest-
            labels:
              run: centos-logtest
              test: centos-logtest
          spec:
            containers:
            - image: quay.io/mffiedler/ocp-logtest:latest
              imagePullPolicy: Always
              name: centos-logtest
              resources: {}
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
              - mountPath: /var/lib/svt
                name: config
            dnsPolicy: ClusterFirst
            imagePullSecrets:
            - name: default-dockercfg-ukomu
            nodeSelector:
              placement: logtest
            restartPolicy: Always
            schedulerName: default-scheduler
            securityContext: {}
            terminationGracePeriodSeconds: 30
            volumes:
            - configMap:
                defaultMode: 420
                name: logtest-config
              name: config
      
      oc create -f rc.yaml

      4 Wait for all the logs to be sent for the logtest pod to the default ES instance. Should take around 40 minutes.

      Check the log count in the ES instance.

      oc rsh elasticsearch-cdm-2rb3icfi-1-97546dfd-zplpc 
      
      es_util --query=app*/_count -d '{"query":{"wildcard":{"kubernetes.pod_namespace":{"value":"logtesta0*","boost":1,"rewrite":"constant_score"}}}}' 
      
      {"count":2150366,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}
      
      sh-4.4$ $ es_util --query=app*/_count                                                           {"count":2150366,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}sh-4.4$ 

      Check the count is around 2150366 while it should be 4500000. We have tested with Fluentd which works fine. 

      How reproducable:
      Always. This issue was reported by our Performance team. 

      Cluster config:
      Server Version: 4.10.0-0.nightly-2022-02-09-111355
      Kubernetes Version: v1.23.3+759c22b

      Storage: GP2
      Cluster size: AWS 3 masters and 3 workers m6i.xlarge

      CL instance:

      apiVersion: "logging.openshift.io/v1"
      kind: "ClusterLogging"
      metadata:
        name: "instance" 
        namespace: "openshift-logging"
      spec:
        managementState: "Managed"  
        logStore:
          type: "elasticsearch"  
          retentionPolicy: 
            application:
              maxAge: 7d
            infra:
              maxAge: 7d
            audit:
              maxAge: 7d
          elasticsearch:
            nodeCount: 3 
            storage:
              storageClassName: "gp2" 
              size: 100G
            resources: 
                requests:
                  memory: "1Gi"
            proxy: 
              resources:
                limits:
                  memory: 256Mi
                requests:
                  memory: 256Mi
            redundancyPolicy: "SingleRedundancy"
        visualization:
          type: "kibana"  
          kibana:
            replicas: 1
        collection:
          logs:
            type: "vector"  
            vector: {} 

      Attached the ES and CL instance status, collector and ES logs . 

        1. clo_instance.yaml
          3 kB
        2. collector.log
          20.96 MB
        3. elasticsearch.log
          10.30 MB
        4. es_instance.yaml
          3 kB

              sninganu@redhat.com Sachin Ninganure
              rhn-support-ikanse Ishwar Kanse
              Ishwar Kanse Ishwar Kanse
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: