Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-5882

Loki failing with input/output error writing chunks and tables /tmp/{loki,wal}, but not alert

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • Logging 5.8.z, Logging 5.9.z
    • Log Storage
    • Incidents & Support
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Log Storage - Sprint 257, Log Storage - Sprint 258, Log Storage - Sprint 260, Log Storage - Sprint 261, Log Storage - Sprint 262, Log Storage - Sprint 263, Log Storage - Sprint 264, Log Storage - Sprint 265, Log Storage - Sprint 266, Logging - Sprint 282
    • Moderate

      Description of problem:

      As consequence of an issue in the storage backend, it's not possible to write to the temporary storage mounted in the Loki Ingester pods in `/tmp/index`  and `/tmp/wal`. The error observed are like:

      $ oc logs lokistack-logging-ingester-0 |grep -i "input/output error" |head -4
      2024-07-15T07:01:18.141369178Z level=error ts=2024-07-15T07:01:18.050447873Z caller=table_manager.go:143 index-store=boltdb-shipper-2024-04-24 msg="failed to upload table" table=index_19907 err="open /tmp/loki/index/index_19907/1720033200.temp: input/output error"
      2024-07-15T07:01:30.641416067Z level=error ts=2024-07-15T07:01:30.604133647Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error"
      2024-07-15T07:01:44.541289292Z level=error ts=2024-07-15T07:01:44.512366666Z caller=wlog.go:243 msg="Failed to calculate size of \"wal\" dir" err="lstat /tmp/wal: input/output error"
      2024-07-15T07:01:45.241215638Z level=error ts=2024-07-15T07:01:45.14634612Z caller=flush.go:143 org_id=infrastructure msg="failed to flush" err="failed to flush chunks: store put chunk: open /tmp/loki/index/index_19907/1721026800: input/output error, num_chunks: 1, labels: {kubernetes_container_name=\"ovn-controller\", kubernetes_host=\"xxxxxxxxxxxxxxxxxxxxx-infra-e820000004001a-az2-s8qk7\", kubernetes_namespace_name=\"openshift-ovn-kubernetes\", kubernetes_pod_name=\"ovnkube-node-wskq9\", log_type=\"infrastructure\"}" 

      But, even when Loki is not able to write to `/tmp/wal` and `/tmp/loki` with the error input/output error, an alert doesn't exist for Loki that alerts to the OpenShift/Loki Admin about the situation.

      Version-Release number of selected component (if applicable):

      RH Loki operator 5.9.2

      How reproducible:

      It should be induced an error with the block backend storage. As this could be not being easy to reproduce a similar situation can be when the filesystems mounted in `/tmp/loki` and `/tmp/wal` get full.

      When this happens, errors are observed in the Loki pods, but not an alert that Loki not working as expected.

      Steps to Reproduce:

      As the first scenario described is not easy to reproduce. Then, I'll describe the steps for the second scenario that could help to check if any Loki metrics could help to trigger an alert below these scenarios:

      Let's review the PVC assigned to the Loki ingester-0

      $ oc get pvc -n openshift-logging|grep -i ingester-0
      storage-logging-loki-ingester-0        Bound    pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949   10Gi       RWO            gp3-csi        16h
      wal-logging-loki-ingester-0            Bound    pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793   150Gi      RWO            gp3-csi        16h

      Let's review the node where running the Loki ingester-0:

      $ oc get pod logging-loki-ingester-0 -o wide -n openshift-logging
      NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
      logging-loki-ingester-0   1/1     Running   0          16h   10.129.2.20   ip-10-0-69-202.eu-central-1.compute.internal   <none>           <none>
      

      Let's enter in the node where running the Loki ingester-0 and see where the PVC are mounted:

      $ oc debug node/ip-10-0-10-10-eu-central-1.compute.internal
      
      sh-4.4# mount |grep pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793
      /dev/nvme1n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount type ext4 (rw,relatime,seclabel)
      
      sh-4.4# mount |grep pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949
      /dev/nvme3n1 on /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount type ext4 (rw,relatime,seclabel)
      
      sh-4.4# ls /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193a-ac5b-48dd-a0c9-c6dd541af949/mount
      index  lost+found
      

      Let's review the size of the PVC:

      sh-4.4# df -kh /dev/nvme3n1
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/nvme3n1    9.8G  136K  9.8G   1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount
      
      sh-4.4# df -kh /dev/nvme1n1
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/nvme1n1    147G   59M  147G   1% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount
      

      Let's force to get the PVC full:

      sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~193da-ac5b-48dd-a0c9-c6dd541af949/mount/fulldisk bs=10G
      
      sh-4.4# dd if=/dev/zero of=/host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-419cdc49-6cd8-4dea-8be2-b5af06a23793/mount/fulldisk bs=150G
      

      Let's review that not free space:

      sh-4.4# df -kh /dev/nvme3n1
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/nvme3n1    9.8G  9.8G     0 100% /host/var/lib/kubelet/pods/4d48f154-6046-4408-a04a-fb2dbc2cbe12/volumes/kubernetes.io~csi/pvc-5ba193da-ac5b-48dd-a0c9-c6dd541af949/mount
      

      Let's do the same exercise for the Loki `ingester-1`  and after that, verify that both `ingester` pods are not able to write to the PVC:

      $ oc -n openshift-logging logs logging-loki-ingester-0|grep "failed to flush chunks" |head -1
      level=error ts=2024-07-25T08:53:25.89135743Z caller=flush.go:178 component=ingester loop=31 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721897100: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"download-server\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-console\", kubernetes_pod_name=\"downloads-86d9bcf76d-k7fn9\", log_type=\"infrastructure\", service_name=\"unknown_service\"}"
      
      $ oc logs logging-loki-ingester-0|grep -c "failed to flush chunks" 
      17
      
      $ oc -n openshift-logging logs logging-loki-ingester-1|grep "failed to flush chunks" |head -1
      level=error ts=2024-07-25T09:11:19.024982642Z caller=flush.go:178 component=ingester loop=28 org_id=infrastructure msg="failed to flush" retries=0 err="failed to flush chunks: store put chunk: write /tmp/loki/index/index_19929/1721898000: no space left on device, num_chunks: 1, labels: \{kubernetes_container_name=\"network-operator\", kubernetes_host=\"ip-10-0-10-10.eu-central-1.compute.internal\", kubernetes_namespace_name=\"openshift-network-operator\", kubernetes_pod_name=\"network-operator-7cdbf58855-zvhw6\", log_type=\"infrastructure\", service_name=\"unknown_service\"}"
      

      Actual results:

      Not alert triggered indicating that Loki not able to write the chunks in `/tmp/wal` and the tables in `/tmp/loki/`

      Expected results:

      A Loki alert is triggered indicating that not able to write the chunks/tables that helps to the OpenShift/Loki admin to review what's happening and react to the issue.

      Additional info:

      This bug goes in the same direction of https://issues.redhat.com/browse/LOG-5470 where trying to get better alerting about Loki

              rojacob@redhat.com Robert Jacob
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: