Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1626

alertmanager pod restarted once to become ready

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Minor
    • 4.14.0
    • 4.13.0, 4.12.0
    • Monitoring
    • Low
    • MON Sprint 237
    • 1
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      4.12.0-0.nightly-2022-09-20-095559 fresh cluster,  alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first

      # oc -n openshift-monitoring get pod
      NAME                                                     READY   STATUS    RESTARTS       AGE
      alertmanager-main-0                                      6/6     Running   1 (118m ago)   118m
      alertmanager-main-1                                      6/6     Running   1 (118m ago)   118m
      ...
      
      # oc -n openshift-monitoring describe pod alertmanager-main-0 
      ...
      Containers:
        alertmanager:
          Container ID:  cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
          Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
          Ports:         9094/TCP, 9094/UDP
          Host Ports:    0/TCP, 0/UDP
          Args:
            --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml
            --storage.path=/alertmanager
            --data.retention=120h
            --cluster.listen-address=[$(POD_IP)]:9094
            --web.listen-address=127.0.0.1:9093
            --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring
            --web.route-prefix=/
            --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
            --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
            --cluster.reconnect-timeout=5m
            --web.config.file=/etc/alertmanager/web_config/web-config.yaml
          State:       Running
            Started:   Wed, 21 Sep 2022 19:40:14 -0400
          Last State:  Terminated
            Reason:    Error
            Message:   s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)"
      ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)"
      ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
      ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s"
      ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
      ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
      ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
      ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory"
      ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms      Exit Code:    1
            Started:      Wed, 21 Sep 2022 19:40:06 -0400
            Finished:     Wed, 21 Sep 2022 19:40:07 -0400
          Ready:          True
          Restart Count:  1
          Requests:
            cpu:     4m
            memory:  40Mi
          Startup:   exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40
      ...
      
      # oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
      "global":
        "resolve_timeout": "5m"
      "inhibit_rules":
      - "equal":
        - "namespace"
        - "alertname"
        "source_matchers":
        - "severity = critical"
        "target_matchers":
        - "severity =~ warning|info"
      - "equal":
        - "namespace"
        - "alertname"
      
      ...

      Version-Release number of selected component (if applicable):

      # oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2022-09-20-095559   True        False         109m    Cluster version is 4.12.0-0.nightly-2022-09-20-095559
      

      How reproducible:

      always

      Steps to Reproduce:

      1. see the steps
      2.
      3.
      

      Actual results:

      alertmanager pod restarted once to become ready

      Expected results:

      no restart

      Additional info:

      no issue with 4.11

      # oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2022-09-20-140029   True        False         16m     Cluster version is 4.11.0-0.nightly-2022-09-20-140029
      # oc -n openshift-monitoring get pod | grep alertmanager-main
      alertmanager-main-0                                      6/6     Running   0          54m
      alertmanager-main-1                                      6/6     Running   0          55m 

      Attachments

        Activity

          People

            janantha@redhat.com Jayapriya Pai
            juzhao@redhat.com Junqi Zhao
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: