-
Bug
-
Resolution: Done
-
Normal
-
Logging 5.4.0
-
False
-
False
-
NEW
-
VERIFIED
-
-
Logging (Core) - Sprint 214, Logging (Core) - Sprint 215, Logging (Core) - Sprint 216
When sending logs at a high volume and rate, loss of logs at round 50% is observed.
Steps to reproduce the issue:
1 Install Logging and Elasticsearch operator 5.5 - preview.
2 Create a ClusterLoggingInstance with Vector as collector.
3 Create a log producer pod which sends 4500000 of logs with a length of 1024 and a rate of 150000.
oc new-project logtesta0 oc label nodes --all placement=logtest cat cm.yaml apiVersion: v1 data: ocp_logtest.cfg: | --num-lines 4500000 --line-length 1024 --word-length 9 --rate 150000 --fixed-line kind: ConfigMap metadata: name: logtest-config namespace: logtesta0 oc create -f cm.yaml cat rc.yaml apiVersion: v1 kind: ReplicationController metadata: generation: 1 labels: run: centos-logtest test: centos-logtest name: centos-logtest namespace: logtesta0 spec: replicas: 1 selector: run: centos-logtest test: centos-logtest template: metadata: generateName: centos-logtest- labels: run: centos-logtest test: centos-logtest spec: containers: - image: quay.io/mffiedler/ocp-logtest:latest imagePullPolicy: Always name: centos-logtest resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/svt name: config dnsPolicy: ClusterFirst imagePullSecrets: - name: default-dockercfg-ukomu nodeSelector: placement: logtest restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 name: logtest-config name: config oc create -f rc.yaml
4 Wait for all the logs to be sent for the logtest pod to the default ES instance. Should take around 40 minutes.
Check the log count in the ES instance.
oc rsh elasticsearch-cdm-2rb3icfi-1-97546dfd-zplpc es_util --query=app*/_count -d '{"query":{"wildcard":{"kubernetes.pod_namespace":{"value":"logtesta0*","boost":1,"rewrite":"constant_score"}}}}' {"count":2150366,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}} sh-4.4$ $ es_util --query=app*/_count {"count":2150366,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}sh-4.4$
Check the count is around 2150366 while it should be 4500000. We have tested with Fluentd which works fine.
How reproducable:
Always. This issue was reported by our Performance team.
Cluster config:
Server Version: 4.10.0-0.nightly-2022-02-09-111355
Kubernetes Version: v1.23.3+759c22b
Storage: GP2
Cluster size: AWS 3 masters and 3 workers m6i.xlarge
CL instance:
apiVersion: "logging.openshift.io/v1" kind: "ClusterLogging" metadata: name: "instance" namespace: "openshift-logging" spec: managementState: "Managed" logStore: type: "elasticsearch" retentionPolicy: application: maxAge: 7d infra: maxAge: 7d audit: maxAge: 7d elasticsearch: nodeCount: 3 storage: storageClassName: "gp2" size: 100G resources: requests: memory: "1Gi" proxy: resources: limits: memory: 256Mi requests: memory: 256Mi redundancyPolicy: "SingleRedundancy" visualization: type: "kibana" kibana: replicas: 1 collection: logs: type: "vector" vector: {}
Attached the ES and CL instance status, collector and ES logs .