-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
Logging 5.8.z, Logging 5.9.z
-
Incidents & Support
-
5
-
False
-
-
False
-
NEW
-
NEW
-
Bug Fix
-
-
-
Log Storage - Sprint 257, Log Storage - Sprint 258, Log Storage - Sprint 260, Log Storage - Sprint 261, Log Storage - Sprint 262, Log Storage - Sprint 263, Log Storage - Sprint 264, Log Storage - Sprint 265, Log Storage - Sprint 266, Logging - Sprint 282
-
Moderate
Description of problem:
As consequence of an issue in the storage backend, it's not possible to write to the temporary storage mounted in the Loki Ingester pods in `/tmp/index` and `/tmp/wal`. The error observed are like:
$ oc logs lokistack-logging-ingester-0 |grep -i "input/output error" |head -4 2024-07-15T07:01:18.141369178Z level=error ts=2024-07-15T07:01:18.050447873Z caller=table_manager.go:143 index-store=boltdb-shipper-2024-04-24 msg="failed to upload table" table=index_19907 err="open /tmp/loki/index/index_19907/1720033200.temp: input/output error" 2024-07-15T07:01:30.641416067Z level=error ts=2024-07-15T07:01:30.604133647Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error" 2024-07-15T07:01:44.541289292Z level=error ts=2024-07-15T07:01:44.512366666Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error" 2024-07-15T07:01:45.241215638Z level=error ts=2024-07-15T07:01:45.14634612Z caller=flush.go:143 org_id=infrastructure msg="failed to flush" err="failed to flush chunks: store put chunk: open /tmp/loki/index/index_19907/1721026800: input/output error, num_chunks: 1, labels: {kubernetes_container_name=\"ovn-controller\", kubernetes_host=\"xxxxxxxxxxxxxxxxxxxxx-infra-e820000004001a-az2-s8qk7\", kubernetes_namespace_name=\"openshift-ovn-kubernetes\", kubernetes_pod_name=\"ovnkube-node-wskq9\", log_type=\"infrastructure\"}"
But, even when Loki is not able to write to `/tmp/wal` and `/tmp/loki` with the error input/output error, an alert doesn't exist for Loki that alerts to the OpenShift/Loki Admin about the situation.
Version-Release number of selected component (if applicable):
RH Loki operator 5.9.2
How reproducible:
It should be induced an error with the block backend storage. As this could be not being easy to reproduce a similar situation can be when the filesystems mounted in `/tmp/loki` and `/tmp/wal` get full.
When this happens, errors are observed in the Loki pods, but not an alert that Loki not working as expected.
Steps to Reproduce:
As the first scenario described is not easy to reproduce. Then, I'll describe the steps for the second scenario that could help to check if any Loki metrics could help to trigger an alert below these scenarios:
Let's review the PVC assigned to the Loki ingester-0
$ oc get pvc -n openshift-logging|grep -i ingester-0 storage-logging-loki-ingester-0 Bound pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949 10Gi RWO gp3-csi 16h wal-logging-loki-ingester-0 Bound pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793 150Gi RWO gp3-csi 16h
Let's review the node where running the Loki ingester-0:
$ oc get pod logging-loki-ingester-0 -o wide -n openshift-logging NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES logging-loki-ingester-0 1/1 Running 0 16h 10.129.2.20 ip-10-0-69-202.eu-central-1.compute.internal <none> <none>
Let's enter in the node where running the Loki ingester-0 and see where the PVC are mounted:
$ oc debug node/ip-10-0-10-10-eu-central-1.compute.internal sh-4.4# mount |grep pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793 /dev/nvme1n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount type ext4 (rw,relatime,seclabel) sh-4.4# mount |grep pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949 /dev/nvme3n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount type ext4 (rw,relatime,seclabel) sh-4.4# ls /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193a-ac5b-48dd-a0c9-c6dd541af949/mount index lost+found
Let's review the size of the PVC:
sh-4.4# df -kh /dev/nvme3n1 Filesystem Size Used Avail Use% Mounted on /dev/nvme3n1 9.8G 136K 9.8G 1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount sh-4.4# df -kh /dev/nvme1n1 Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 147G 59M 147G 1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount
Let's force to get the PVC full:
sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~193da-ac5b-48dd-a0c9-c6dd541af949/mount/fulldisk bs=10G sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount/fulldisk bs=150G
Let's review that not free space:
sh-4.4# df -kh /dev/nvme3n1
Filesystem Size Used Avail Use% Mounted on
/dev/nvme3n1 9.8G 9.8G 0 100% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount
Let's do the same exercise for the Loki `ingester-1` and after that, verify that both `ingester` pods are not able to write to the PVC:
$ oc -n openshift-logging logs logging-loki-ingester-0|grep "failed to flush chunks" |head -1 level=error ts=2024-07-25T08:53:25.89135743Z caller=flush.go:178 component=ingester loop=31 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721897100: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"download-server\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-console\", kubernetes_pod_name=\"downloads-86d9bcf76d-k7fn9\", log_type=\"infrastructure\", service_name=\"unknown_service\"}" $ oc logs logging-loki-ingester-0|grep -c "failed to flush chunks" 17 $ oc -n openshift-logging logs logging-loki-ingester-1|grep "failed to flush chunks" |head -1 level=error ts=2024-07-25T09:11:19.024982642Z caller=flush.go:178 component=ingester loop=28 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721898000: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"network-operator\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-network-operator\", kubernetes_pod_name=\"network-operator-7cdbf58855-zvhw6\", log_type=\"infrastructure\", service_name=\"unknown_service\"}"
Actual results:
Not alert triggered indicating that Loki not able to write the chunks in `/tmp/wal` and the tables in `/tmp/loki/`
Expected results:
A Loki alert is triggered indicating that not able to write the chunks/tables that helps to the OpenShift/Loki admin to review what's happening and react to the issue.
Additional info:
This bug goes in the same direction of https://issues.redhat.com/browse/LOG-5470 where trying to get better alerting about Loki
- is related to
-
LOG-6238 Implement alert for failed object store flushes
-
- Closed
-
- links to