Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-418

[2322173] CephNodeDown Alert not triggered when Worker node Is gracefully stopped

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.17
    • odf-4.17
    • ceph-monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      `test_stop_start_node_validate_topology` constantly fails on multiple deployments.
      CephNodeDown alert is not firing after 30s and not visible both via Prometheus and management-console. Issue detected on vSphere clusters.

      *Failed: One or multiple checks did not pass:*

      Check check_state
      -------------------------------------------------------------- -------------
      prometheus_CephNodeDown_alert_fired 0
      cluster_in_danger_state_check_pass 1
      ceph_node_down_alert_found_check_pass 0
      ceph_node_down_alert_found_on_idle_node_check_pass 1
      prometheus_CephNodeDown_alert_removed 1
      ceph_node_down_alert_found_after_node_turned_on_check_pass 1

      Version of all relevant components (if applicable):
      ODF 4.17.0-105, OCP 4.17

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      in some cases

      Is there any workaround available to the best of your knowledge?
      no

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      2

      Can this issue reproducible?
      yes

      Can this issue reproduce from the UI?
      yes

      If this is a regression, please provide more details to justify this:
      regression. We had successful test executions, but a long ago. no history saved.

      Steps to Reproduce:
      1. Turn off the node from VMWare console or in other way
      2. login to management console and verify CephNodeDown alert appeared or make a prometheus request to fetch fired alerts with req
      curl -k -H "Authorization: Bearer $TOKEN" -X GET https://$ROUTE/api/v1/alerts | jq '.data.alerts[].labels.alertname'
      3.

      Actual results:
      CephNodeDown fail to fire

      Expected results:
      CephNodeDown fires when one ceph node is down for 30+sec

      Additional info:
      OCP and OCS must-gather logs https://url.corp.redhat.com/2a64b06

              dkamboj@redhat.com Divyansh Kamboj
              rh-ee-dosypenk Daniel Osypenko
              Harish Nallur Vittal Rao Harish Nallur Vittal Rao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: