-
Bug
-
Resolution: Done-Errata
-
Minor
-
4.13.0, 4.12.0
-
Low
-
None
-
MON Sprint 237
-
1
-
False
-
Description of problem:
4.12.0-0.nightly-2022-09-20-095559 fresh cluster, alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first
# oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 1 (118m ago) 118m alertmanager-main-1 6/6 Running 1 (118m ago) 118m ... # oc -n openshift-monitoring describe pod alertmanager-main-0 ... Containers: alertmanager: Container ID: cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=127.0.0.1:9093 --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 --cluster.reconnect-timeout=5m --web.config.file=/etc/alertmanager/web_config/web-config.yaml State: Running Started: Wed, 21 Sep 2022 19:40:14 -0400 Last State: Terminated Reason: Error Message: s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)" ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)" ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s" ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory" ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms Exit Code: 1 Started: Wed, 21 Sep 2022 19:40:06 -0400 Finished: Wed, 21 Sep 2022 19:40:07 -0400 Ready: True Restart Count: 1 Requests: cpu: 4m memory: 40Mi Startup: exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40 ... # oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_matchers": - "severity = critical" "target_matchers": - "severity =~ warning|info" - "equal": - "namespace" - "alertname" ...
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-09-20-095559 True False 109m Cluster version is 4.12.0-0.nightly-2022-09-20-095559
How reproducible:
always
Steps to Reproduce:
1. see the steps 2. 3.
Actual results:
alertmanager pod restarted once to become ready
Expected results:
no restart
Additional info:
no issue with 4.11
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-20-140029 True False 16m Cluster version is 4.11.0-0.nightly-2022-09-20-140029 # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 54m alertmanager-main-1 6/6 Running 0 55m
- is related to
-
OCPBUGS-14033 Pathological test failing on reason/RecreatingFailedPod in openshift-monitoring
- Closed
- links to
-
RHEA-2023:5006 rpm