Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: Logging 5.8.z, Logging 5.9.z
Component/s: Log Storage
Labels:
- devel_ack+

Activity Type:
Incidents & Support
Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix
Intelligence Requested:
Market:

Sprint:
Log Storage - Sprint 257, Log Storage - Sprint 258, Log Storage - Sprint 260, Log Storage - Sprint 261, Log Storage - Sprint 262, Log Storage - Sprint 263, Log Storage - Sprint 264, Log Storage - Sprint 265, Log Storage - Sprint 266, Logging - Sprint 282
Severity:
Moderate

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

As consequence of an issue in the storage backend, it's not possible to write to the temporary storage mounted in the Loki Ingester pods in `/tmp/index` and `/tmp/wal`. The error observed are like:

$ oc logs lokistack-logging-ingester-0 |grep -i "input/output error" |head -4
2024-07-15T07:01:18.141369178Z level=error ts=2024-07-15T07:01:18.050447873Z caller=table_manager.go:143 index-store=boltdb-shipper-2024-04-24 msg="failed to upload table" table=index_19907 err="open /tmp/loki/index/index_19907/1720033200.temp: input/output error"
2024-07-15T07:01:30.641416067Z level=error ts=2024-07-15T07:01:30.604133647Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error"
2024-07-15T07:01:44.541289292Z level=error ts=2024-07-15T07:01:44.512366666Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error"
2024-07-15T07:01:45.241215638Z level=error ts=2024-07-15T07:01:45.14634612Z caller=flush.go:143 org_id=infrastructure msg="failed to flush" err="failed to flush chunks: store put chunk: open /tmp/loki/index/index_19907/1721026800: input/output error, num_chunks: 1, labels: {kubernetes_container_name=\"ovn-controller\", kubernetes_host=\"xxxxxxxxxxxxxxxxxxxxx-infra-e820000004001a-az2-s8qk7\", kubernetes_namespace_name=\"openshift-ovn-kubernetes\", kubernetes_pod_name=\"ovnkube-node-wskq9\", log_type=\"infrastructure\"}"

But, even when Loki is not able to write to `/tmp/wal` and `/tmp/loki` with the error input/output error, an alert doesn't exist for Loki that alerts to the OpenShift/Loki Admin about the situation.

Version-Release number of selected component (if applicable):

RH Loki operator 5.9.2

How reproducible:

It should be induced an error with the block backend storage. As this could be not being easy to reproduce a similar situation can be when the filesystems mounted in `/tmp/loki` and `/tmp/wal` get full.

When this happens, errors are observed in the Loki pods, but not an alert that Loki not working as expected.

Steps to Reproduce:

As the first scenario described is not easy to reproduce. Then, I'll describe the steps for the second scenario that could help to check if any Loki metrics could help to trigger an alert below these scenarios:

Let's review the PVC assigned to the Loki ingester-0

$ oc get pvc -n openshift-logging|grep -i ingester-0
storage-logging-loki-ingester-0        Bound    pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949   10Gi       RWO            gp3-csi        16h
wal-logging-loki-ingester-0            Bound    pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793   150Gi      RWO            gp3-csi        16h

Let's review the node where running the Loki ingester-0:

$ oc get pod logging-loki-ingester-0 -o wide -n openshift-logging
NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
logging-loki-ingester-0   1/1     Running   0          16h   10.129.2.20   ip-10-0-69-202.eu-central-1.compute.internal   <none>           <none>

Let's enter in the node where running the Loki ingester-0 and see where the PVC are mounted:

$ oc debug node/ip-10-0-10-10-eu-central-1.compute.internal

sh-4.4# mount |grep pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793
/dev/nvme1n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount type ext4 (rw,relatime,seclabel)

sh-4.4# mount |grep pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949
/dev/nvme3n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount type ext4 (rw,relatime,seclabel)

sh-4.4# ls /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193a-ac5b-48dd-a0c9-c6dd541af949/mount
index  lost+found

Let's review the size of the PVC:

sh-4.4# df -kh /dev/nvme3n1
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme3n1    9.8G  136K  9.8G   1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount

sh-4.4# df -kh /dev/nvme1n1
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    147G   59M  147G   1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount

Let's force to get the PVC full:

sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~193da-ac5b-48dd-a0c9-c6dd541af949/mount/fulldisk bs=10G

sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount/fulldisk bs=150G

Let's review that not free space:

sh-4.4# df -kh /dev/nvme3n1
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme3n1    9.8G  9.8G     0 100% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount

Let's do the same exercise for the Loki `ingester-1` and after that, verify that both `ingester` pods are not able to write to the PVC:

$ oc -n openshift-logging logs logging-loki-ingester-0|grep "failed to flush chunks" |head -1
level=error ts=2024-07-25T08:53:25.89135743Z caller=flush.go:178 component=ingester loop=31 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721897100: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"download-server\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-console\", kubernetes_pod_name=\"downloads-86d9bcf76d-k7fn9\", log_type=\"infrastructure\", service_name=\"unknown_service\"}"

$ oc logs logging-loki-ingester-0|grep -c "failed to flush chunks" 
17

$ oc -n openshift-logging logs logging-loki-ingester-1|grep "failed to flush chunks" |head -1
level=error ts=2024-07-25T09:11:19.024982642Z caller=flush.go:178 component=ingester loop=28 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721898000: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"network-operator\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-network-operator\", kubernetes_pod_name=\"network-operator-7cdbf58855-zvhw6\", log_type=\"infrastructure\", service_name=\"unknown_service\"}"

Actual results:

Not alert triggered indicating that Loki not able to write the chunks in `/tmp/wal` and the tables in `/tmp/loki/`

Expected results:

A Loki alert is triggered indicating that not able to write the chunks/tables that helps to the OpenShift/Loki admin to review what's happening and react to the issue.

Additional info:

This bug goes in the same direction of https://issues.redhat.com/browse/LOG-5470 where trying to get better alerting about Loki

is related to

LOG-6238 Implement alert for failed object store flushes

Closed

links to

[KCS] Loki doesn't trigger an alert when not able to write the chunks/tables to the local filesystem

Assignee:: Robert Jacob

Reporter:: Oscar Casal Sanchez

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/07/25 11:23 AM

Updated:: 2025/12/03 3:04 PM

Resolved:: 2025/12/03 3:04 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates