Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Minor
Fix Version/s: 4.14.0
Affects Version/s: 4.13.0, 4.12.0
Component/s: Monitoring
Labels:
- Regression

Severity:
Low
Regression:
None
Sprint:
MON Sprint 237
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

4.12.0-0.nightly-2022-09-20-095559 fresh cluster, alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first

# oc -n openshift-monitoring get pod
NAME                                                     READY   STATUS    RESTARTS       AGE
alertmanager-main-0                                      6/6     Running   1 (118m ago)   118m
alertmanager-main-1                                      6/6     Running   1 (118m ago)   118m
...

# oc -n openshift-monitoring describe pod alertmanager-main-0 
...
Containers:
  alertmanager:
    Container ID:  cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
    Ports:         9094/TCP, 9094/UDP
    Host Ports:    0/TCP, 0/UDP
    Args:
      --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml
      --storage.path=/alertmanager
      --data.retention=120h
      --cluster.listen-address=[$(POD_IP)]:9094
      --web.listen-address=127.0.0.1:9093
      --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring
      --web.route-prefix=/
      --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
      --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
      --cluster.reconnect-timeout=5m
      --web.config.file=/etc/alertmanager/web_config/web-config.yaml
    State:       Running
      Started:   Wed, 21 Sep 2022 19:40:14 -0400
    Last State:  Terminated
      Reason:    Error
      Message:   s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)"
ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory"
ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms      Exit Code:    1
      Started:      Wed, 21 Sep 2022 19:40:06 -0400
      Finished:     Wed, 21 Sep 2022 19:40:07 -0400
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:     4m
      memory:  40Mi
    Startup:   exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40
...

# oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
"global":
  "resolve_timeout": "5m"
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"

...

Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-09-20-095559   True        False         109m    Cluster version is 4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

1. see the steps
2.
3.

Actual results:

alertmanager pod restarted once to become ready

Expected results:

no restart

Additional info:

no issue with 4.11

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-20-140029   True        False         16m     Cluster version is 4.11.0-0.nightly-2022-09-20-140029
# oc -n openshift-monitoring get pod | grep alertmanager-main
alertmanager-main-0                                      6/6     Running   0          54m
alertmanager-main-1                                      6/6     Running   0          55m

is related to

OCPBUGS-14033 Pathological test failing on reason/RecreatingFailedPod in openshift-monitoring

Closed

links to

openshift/cluster-monitoring-operator#1961: OCPBUGS-1626: update jsonnet dependencies

openshift/prometheus-operator#233: OCPBUGS-1626: [bot] Bump openshift/prometheus-operator to v0.65.1

RHEA-2023:5006 rpm

Assignee:: Jayapriya Pai

Reporter:: Junqi Zhao

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2022/09/22 1:44 AM

Updated:: 2023/10/31 1:37 PM

Resolved:: 2023/10/31 1:37 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates