-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
Logging 5.3.z
-
False
-
None
-
False
-
NEW
-
NEW
-
Logging (LogExp) - Sprint 216
Running Openshift Container Platform 4 - Cluster Logging 5.3.5-20 with the below configuration.
$ oc get clusterlogging instance -n openshift-logging -o json { "apiVersion": "logging.openshift.io/v1", "kind": "ClusterLogging", "metadata": { "creationTimestamp": "2022-03-11T12:23:26Z", "generation": 4, "name": "instance", "namespace": "openshift-logging", "resourceVersion": "46593023", "uid": "c8360f12-e8b8-4903-8497-fb8613e4d4ac" }, "spec": { "collection": { "logs": { "fluentd": {}, "type": "fluentd" } }, "logStore": { "elasticsearch": { "nodeCount": 6, "redundancyPolicy": "MultipleRedundancy", "resources": { "limits": { "cpu": 6, "memory": "28Gi" }, "requests": { "cpu": 6, "memory": "16Gi" } }, "storage": { "size": "400G", "storageClassName": "gp2" } }, "retentionPolicy": { "application": { "maxAge": "30d" }, "audit": { "maxAge": "2h" }, "infra": { "maxAge": "10d" } }, "type": "elasticsearch" }, "managementState": "Managed", "visualization": { "kibana": { "proxy": { "resources": { "limits": { "cpu": 1, "memory": "512Mi" }, "requests": { "cpu": "100m", "memory": "512Mi" } } }, "replicas": 1, "resources": { "limits": { "cpu": 1, "memory": "2Gi" }, "requests": { "cpu": "500m", "memory": "2Gi" } } }, "type": "kibana" } }, "status": { "clusterConditions": [ { "lastTransitionTime": "2022-03-11T12:23:44Z", "status": "False", "type": "CollectorDeadEnd" }, { "lastTransitionTime": "2022-03-11T12:23:32Z", "message": "curator is deprecated in favor of defining retention policy", "reason": "ResourceDeprecated", "status": "True", "type": "CuratorRemoved" } ], "collection": { "logs": { "fluentdStatus": { "daemonSet": "collector", "nodes": { "collector-5ddxs": "X.eu-west-3.compute.internal", "collector-7t6ht": "X.eu-west-3.compute.internal", "collector-b2jp8": "X.eu-west-3.compute.internal", "collector-bk6rw": "X.eu-west-3.compute.internal", "collector-cmqwc": "X.eu-west-3.compute.internal", "collector-hj9cz": "X.eu-west-3.compute.internal", "collector-jgpzz": "X.eu-west-3.compute.internal", "collector-m2gsz": "X.eu-west-3.compute.internal", "collector-rmntl": "X.eu-west-3.compute.internal", "collector-tvcrs": "X.eu-west-3.compute.internal", "collector-zqb6n": "X.eu-west-3.compute.internal" }, "pods": { "failed": [], "notReady": [], "ready": [ "collector-5ddxs", "collector-7t6ht", "collector-b2jp8", "collector-bk6rw", "collector-cmqwc", "collector-hj9cz", "collector-jgpzz", "collector-m2gsz", "collector-rmntl", "collector-tvcrs", "collector-zqb6n" ] } } } }, "curation": {}, "logStore": { "elasticsearchStatus": [ { "cluster": { "activePrimaryShards": 157, "activeShards": 467, "initializingShards": 0, "numDataNodes": 6, "numNodes": 6, "pendingTasks": 0, "relocatingShards": 0, "status": "green", "unassignedShards": 0 }, "clusterName": "elasticsearch", "nodeConditions": { "elasticsearch-cd-b742028q-1": [], "elasticsearch-cd-b742028q-2": [], "elasticsearch-cd-b742028q-3": [], "elasticsearch-cdm-dr5igezq-1": [], "elasticsearch-cdm-dr5igezq-2": [], "elasticsearch-cdm-dr5igezq-3": [] }, "nodeCount": 6, "pods": { "client": { "failed": [], "notReady": [], "ready": [ "elasticsearch-cd-b742028q-1-788cf68686-vn2ss", "elasticsearch-cd-b742028q-2-6f94877bf-vmmkw", "elasticsearch-cd-b742028q-3-79c6bb444d-mv92n", "elasticsearch-cdm-dr5igezq-1-759d4b84b7-qktqw", "elasticsearch-cdm-dr5igezq-2-6b8cfbf6fd-x4kv9", "elasticsearch-cdm-dr5igezq-3-68576d95df-d6hjq" ] }, "data": { "failed": [], "notReady": [], "ready": [ "elasticsearch-cd-b742028q-1-788cf68686-vn2ss", "elasticsearch-cd-b742028q-2-6f94877bf-vmmkw", "elasticsearch-cd-b742028q-3-79c6bb444d-mv92n", "elasticsearch-cdm-dr5igezq-1-759d4b84b7-qktqw", "elasticsearch-cdm-dr5igezq-2-6b8cfbf6fd-x4kv9", "elasticsearch-cdm-dr5igezq-3-68576d95df-d6hjq" ] }, "master": { "failed": [], "notReady": [], "ready": [ "elasticsearch-cdm-dr5igezq-1-759d4b84b7-qktqw", "elasticsearch-cdm-dr5igezq-2-6b8cfbf6fd-x4kv9", "elasticsearch-cdm-dr5igezq-3-68576d95df-d6hjq" ] } }, "shardAllocationEnabled": "all" } ] }, "visualization": { "kibanaStatus": [ { "deployment": "kibana", "pods": { "failed": [], "notReady": [], "ready": [ "kibana-6558856cc5-z7fqc" ] }, "replicaSets": [ "kibana-6558856cc5" ], "replicas": 1 } ] } } }
The redundancyPolicy is set to MultipleRedundancy to have some fault tolerance and therefore the possibility to loss some elasticsearch members before the stack becomes unusable (as we should have replicas available.
When checking the indices, the following is found:
> $ es_util --query="_cat/indices?pretty&v" | grep app
> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
> green open app-000005 H2aTFI-dS3Km4K0vTbEyPg 3 2 278885 0 719.9mb 239.9mb
> green open app-000004 STcVgd8yRd2zjTgK2JPB7g 3 2 0 0 2.2kb 783b
> green open app-000011 A84NnF9hRNKoAwnOZkLxOA 3 2 273220 0 703.2mb 234.4mb
> green open app-000013 4JMO3BByRHOO0CYoLsQ2HQ 5 2 133185 0 354.4mb 114.6mb
> green open app-000008 5cqqe2IoTiC-Rc6YTq6OJA 3 2 276628 0 712.2mb 237.4mb
> green open app-000012 lm5WNEoQRU2EC12vgfwDdg 3 2 104592 0 223mb 74.3mb
> green open app-000002 3ek9gM1DQHSikY_HzX8uHQ 3 2 0 0 2.2kb 783b
> green open app-000003 cy6uu_DcRYOxeBYS8-vSFw 3 2 135 0 1.1mb 395.1kb
> green open app-000001 Zc3cEoArQO-dFUzhO-XbLA 3 2 11 0 178.7kb 59.5kb
> green open app-000006 LwoMQw-TSvGuqxfxTZVtZg 3 2 274707 0 705.8mb 235.2mb
> green open app-000010 4vikLZL6RRuXxS3UHKszvQ 3 2 280188 0 722.4mb 240.8mb
> green open app-000009 -L1ASb2wTNqWwxK61jkhrw 3 2 277961 0 716.7mb 238.9mb
> green open app-000007 Jbk-N4KEQqG1Nj5mMsFSrw 3 2 279945 0 721.8mb 240.6mb
> $ es_util --query="_cat/indices?pretty&v" | grep infra
> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
> green open infra-000288 by-7rVvvR_y7Logd_ywJ8g 3 2 1414527 0 2.7gb 932.5mb
> green open infra-000287 ZvNwwbPERu-tNQewMlLS8A 3 2 1397456 0 2.6gb 919.4mb
> green open infra-000270 OzAGsKmdR4WlcMklUzT0WQ 3 2 173599 0 343.9mb 114.6mb
> green open infra-000280 5l8RPnKJTfOhIsh_PUH2dw 3 2 1397862 0 2.6gb 908.2mb
> green open infra-000273 qMFkjXdpRySoOwk6Nv0Rgg 3 2 1315299 0 2.5gb 871.8mb
> green open infra-000276 pjwUY_3STdCqLZpEQLM7Jg 3 2 1453535 0 2.7gb 945.1mb
> green open infra-000283 Ojsj4YIjQAuF0qH5DemHOw 3 2 1391758 0 2.6gb 904.2mb
> green open infra-000286 7e2FvkAnRvKbkYqpuEiwWw 3 2 1624675 0 3.1gb 1gb
> green open infra-000271 uDPTy3SqR7KmQZLdRbuUNg 3 2 1292893 0 2.4gb 842.8mb
> green open infra-000277 ULK-eScaTF-JxqewbDJRYQ 3 2 1426813 0 2.7gb 930.5mb
> green open infra-000284 ANdfpAEGRyylBeWzuStmgw 3 2 1407210 0 2.6gb 917mb
> green open infra-000282 5XZPxbn1Ts-eowaG07MyyQ 3 2 1357182 0 2.5gb 883mb
> green open infra-000275 QzMlapa8RmyKC6mLDTwZgQ 3 2 1452245 0 2.7gb 943.5mb
> green open infra-000279 ej7smbMUQ3ubTtKxtdMlmw 3 2 1445210 0 2.7gb 943.2mb
> green open infra-000281 M7XmSXrYSBqDpBvZRwzSfw 3 2 1396606 0 2.6gb 909.2mb
> green open infra-000274 MhGKe1dlRe6u1fOdyXsGnw 3 2 1787044 0 3.1gb 1gb
> green open infra-000272 bUfhtkgCSCCV8fH89VRutg 3 2 1773337 0 3.2gb 1gb
> green open infra-000285 B-pqcuzaTIySAOIkTCXhDA 3 2 1391102 0 2.6gb 906.5mb
> green open infra-000278 Wu6mJxx8Q4eDFtMFWYaKlw 3 2 1442864 0 2.7gb 940.7mb
> green open infra-000289 68tTlu1gQ0mRuXPP0dS-BQ 5 2 2257040 0 4.4gb 1.4gb
> green open infra-000290 u9eLPuxISEmR4AJDO6lmzA 5 2 581765 0 1.1gb 408.3mb
As we can see, we have the expected number of primary shards available and the respective number of replicas so that we can survive an outage of a elasticsearch node.
But when checking the kibana and user indices, we can see that the redundancyPolicy is not applied there, meaning we are loosing the complete fault tolerance.
> $ es_util --query="_cat/indices?pretty&v" | grep -v audit | grep -v app | grep -v infra
> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
> green open .kibana_111578566_user1_1 VB-xTC0qSeCR6T6DSSHW6g 1 1 1 0 7.4kb 3.7kb
> green open .kibana_1 Cm9Hr9p1T-2G-S4Nvv54Eg 1 1 0 0 522b 261b
> green open .security YtJCu7FNSB-_e8AYABnrWA 1 1 6 2 61.9kb 30.9kb
> green open .kibana_111578567_user2_1 VU8UDs3gTrS--S2l-PKx9A 1 1 1 0 7.4kb 3.7kb
It's not clear whether this is expected behavior or not and if it's intended behavior, why this is, as this breaks the complete fault tolerance setup.