Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: Logging 6.2.3
Component/s: Log Storage
Labels:
- devel_ack+

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix
Intelligence Requested:
Market:

Severity:
Important

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

In Loki 6.2.3 where it's added the feature "autoforget_unhealthy", it's observed that the second Loki Ingester is 1/1, but it's not able to "autoforget" the unhealthy ingester throwing the next errors:

$ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1
2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout"

$ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1
2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"

As the number of Loki Ingesters is 2, then, logs are not stored in Loki as the replication factor is by default 2

$ oc get cm logging-loki-config -o jsonpath='{.data.config\.yaml}'|grep -i replication_factor
      replication_factor: 2

The Loki distributor log indicating that not enough Loki Ingester replicas:

$ oc logs logging-loki-distributor-744f448548-p9fng  |grep "at least 2 live replicas required"|tail -1
2025-07-10T14:36:20.132251245+05:30 level=warn ts=2025-07-10T09:06:20.062295748Z caller=logging.go:128 orgID=infrastructure msg="POST /loki/api/v1/push (500) 1.789297ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: x.x.x.x:9095\\n\" ws: false; Accept-Encoding: identity; Content-Encoding: snappy; Content-Length: 91598; Content-Type: application/x-protobuf; User-Agent: Vector/0.37.1 (x86_64-unknown-linux-gnu); X-Forwarded-For: x.x.x.x; X-Forwarded-Prefix: /api/logs/v1/infrastructure; X-Scope-Orgid: infrastructure; "

Version-Release number of selected component (if applicable):

Loki 6.2.3
Loki size: 1x.small

How reproducible:

Not abe to reproducible until now

Steps to Reproduce:

Actual results:

The second Loki ingester is not able to join to the cluster with error:

$ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1
2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout"

$ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1
2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"

Expected results:

The second Loki Ingester is forgotten and able to join to the cluster

Additional info:

*Data needed to provide*

Enable log level debug in Loki. Follow the article https://access.redhat.com/solutions/7049665
Get Loki Ring status. Follow step 1 from the Workaround in the Resolution section from the article - https://access.redhat.com/solutions/7122281
Get a partial prometheus dump providing the chunk for the time that the previous steps are done. Follow the article access.redhat.com/solutions/5482971
Provide a logging must-gather at the same time that being the issue. Follow the documentation - https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/logging/support#cluster-logging-must-gather-collecting_cluster-logging-support

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates