Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7474

Log ingestion stops with ingester error autoforget unhealthy

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • Logging 6.2.3
    • Log Storage
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Important

      Description of problem:

      In Loki 6.2.3 where it's added the feature "autoforget_unhealthy", it's observed that the second Loki Ingester is 1/1, but it's not able to "autoforget" the unhealthy ingester throwing the next errors:

      $ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1
      2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout"
      
      $ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1
      2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"
      

      As the number of Loki Ingesters is 2, then, logs are not stored in Loki as the replication factor is by default 2

      $ oc get cm logging-loki-config -o jsonpath='{.data.config\.yaml}'|grep -i replication_factor
            replication_factor: 2
      

      The Loki distributor log indicating that not enough Loki Ingester replicas:

      $ oc logs logging-loki-distributor-744f448548-p9fng  |grep "at least 2 live replicas required"|tail -1
      2025-07-10T14:36:20.132251245+05:30 level=warn ts=2025-07-10T09:06:20.062295748Z caller=logging.go:128 orgID=infrastructure msg="POST /loki/api/v1/push (500) 1.789297ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: x.x.x.x:9095\\n\" ws: false; Accept-Encoding: identity; Content-Encoding: snappy; Content-Length: 91598; Content-Type: application/x-protobuf; User-Agent: Vector/0.37.1 (x86_64-unknown-linux-gnu); X-Forwarded-For: x.x.x.x; X-Forwarded-Prefix: /api/logs/v1/infrastructure; X-Scope-Orgid: infrastructure; "
      

      Version-Release number of selected component (if applicable):

      Loki 6.2.3
      Loki size: 1x.small

      How reproducible:

      Not abe to reproducible until now

      Steps to Reproduce:

      Actual results:

      The second Loki ingester is not able to join to the cluster with error:

      $ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1
      2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout"
      
      $ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1
      2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"
      

      Expected results:

      The second Loki Ingester is forgotten and able to join to the cluster

      Additional info:

      *Data needed to provide*

              Unassigned Unassigned
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: