-
Bug
-
Resolution: Obsolete
-
Major
-
None
-
False
-
None
-
False
-
NEW
-
NEW
-
Bug Fix
-
-
-
-
Moderate
-
Customer Escalated
Description of problem:
Prometheus could not scrape collector (Fluentd) metric in dualstack cluster.
Alert "collectornodedown" continuously firing for all collectors when using fluentd collector in dualstack cluster.
Checking netstat for IPv6 port listening on port 24231:
$ oc -n openshift-logging get pods -l component=collector -o=custom-columns=:metadata.name --no-headers | xargs -r -n1 -I {} oc -n openshift-logging exec {} -c collector -- bash -c 'echo -n {}:; ss -6 -lt | grep 24231;'
xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
collector-6pd7j:LISTEN 0 4096 [::]:24231 [::]:*
collector-9wv7l:LISTEN 0 4096 [::]:24231 [::]:*
collector-czc5r:LISTEN 0 4096 [::]:24231 [::]:*
collector-dp2pk:LISTEN 0 4096 [::]:24231 [::]:*
collector-g59z6:LISTEN 0 4096 [::]:24231 [::]:*
collector-kn6hf:LISTEN 0 4096 [::]:24231 [::]:*
collector-lqjf5:LISTEN 0 4096 [::]:24231 [::]:*
collector-t6mqs:LISTEN 0 4096 [::]:24231 [::]:*
collector-vm2hn:LISTEN 0 4096 [::]:24231 [::]:*
collector-wtk6w:LISTEN 0 4096 [::]:24231 [::]:*
Checking netstat for IPv4 port listening on port 24231:
$ oc -n openshift-logging get pods -l component=collector -o=custom-columns=:metadata.name --no-headers | xargs -r -n1 -I {} oc -n openshift-logging exec {} -c collector -- bash -c 'echo -n {}:; ss -4 -lt | grep 24231;'
xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
collector-6pd7j:command terminated with exit code 1
collector-9wv7l:command terminated with exit code 1
collector-czc5r:command terminated with exit code 1
collector-dp2pk:command terminated with exit code 1
collector-g59z6:command terminated with exit code 1
collector-kn6hf:command terminated with exit code 1
collector-lqjf5:command terminated with exit code 1
collector-t6mqs:command terminated with exit code 1
collector-vm2hn:command terminated with exit code 1
collector-wtk6w:command terminated with exit code 1
The logging is working fine only alert is firing due to prometheus not able to connect to collector:
$ oc project openshift-monitoring
$ oc rsh prometheus-k8s-0
sh-4.4$ curl -kv https://x.x.x.x:24231/metrics
* Trying x.x.x.x...
* TCP_NODELAY set
* connect to x.x.x.x port 24231 failed: Connection refused
* Failed to connect to x.x.x.x port 24231: Connection refused
* Closing connection 0
curl: (7) Failed to connect to x.x.x.x port 24231: Connection refused
Below workaround fixed the issue:
- Put clusterlogging in Unmanaged.
- Take backup of collector-config configmap.
- Modify line: bind "#
{ENV['PROM_BIND_IP']}
" to bind "0.0.0.0" in collector-config configmap.
- Save the configmap and restart the collector pods.
- Tested that curl from prometheus to collector pods was successful and the collector node down alert also got clear.
Version-Release number of selected component (if applicable):
Bug is present only for fluentd collector and not with vector.
Logging version 5.8.2 and 5.8.3
How reproducible: 100%
Steps to Reproduce:
- Deploy a dual stack cluster.
- Install logging operator version 5.8.2.
- Setup clusterlogging/instance with fluentd collector.
- Check connectivity from prometheus to collector metrics.
- After sometime "CollectorNodeDown" alert starts firing.
Actual results:
Prometheus unable curl collector metric and alert "CollectorNodeDown" firing continuously.
Expected results:
Prometheus should be able to curl collector metric and alert "CollectorNodeDown" should not fire.
- links to