Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18707

Inadvertent peering of alertmanager instances during upgrade

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • 4.15.0
    • 4.12.0, 4.11.0
    • Monitoring
    • None
    • Moderate
    • No
    • MON Sprint 244
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, instances of Alertmanager for core platform monitoring and for user-defined projects could inadvertently become peered during an upgrade. This issue could occur when multiple Alertmanager instances were deployed in the same cluster. This release fixes the issue by adding a `--cluster.label` flag to Alertmanager that helps to block any traffic that is not intended for the cluster.
      (link:https://issues.redhat.com/browse/OCPBUGS-18707[*OCPBUGS-18707*])
      Show
      * Previously, instances of Alertmanager for core platform monitoring and for user-defined projects could inadvertently become peered during an upgrade. This issue could occur when multiple Alertmanager instances were deployed in the same cluster. This release fixes the issue by adding a `--cluster.label` flag to Alertmanager that helps to block any traffic that is not intended for the cluster. (link: https://issues.redhat.com/browse/OCPBUGS-18707 [* OCPBUGS-18707 *])
    • Bug Fix
    • Done

      Description of problem:

      Cluster and userworkload alertmanager instances inadvertenly become peered during upgrade

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      infrequently - customer observed this on 3 cluster out of 15 

      Steps to Reproduce:

      Deploy userworkload monitoring 
      
      ~~~
       config.yaml: |
          enableUserWorkload: true
          prometheusK8s:
      ~~~
      
      Deploy user workload alertmanager  
      
      ~~~
        name: user-workload-monitoring-config
        namespace: openshift-user-workload-monitoring
      data:
        config.yaml: |
          alertmanager:
            enabled: true 
      ~~~
      
      upgrade the cluster
      verify the state of the alertmanager clusters: 
      
      ~~~
      $ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool cluster show -o json --alertmanager.url=http://localhost:9093
      ~~~

      Actual results:

      alertmanager show 4 peers

      Expected results:

      we should have 2 pairs 

      Additional info:

      Mitigation steps: 
      
      Scaling down one of the alertmanager statefulsets to 0 and then scaling up again restores the expected configuration (i.e. 2 separate alertmanager clusters)
      
      - the customer then added networkpolicies to prevent alertmanager gossip between namespaces. 

              janantha@redhat.com Jayapriya Pai
              rhn-support-nigsmith Nigel Smith
              Junqi Zhao Junqi Zhao
              Brian Burt Brian Burt
              Simon Pasquier
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: