Uploaded image for project: 'OpenShift Core Networking'
  1. OpenShift Core Networking
  2. CORENET-2506

OVN-K alerts: add OVN controller disconnection alert

    • 1
    • False
    • None
    • False
    • If docs needed, set a value
    • Unset
    • ?
    • ?
    • ?
    • ?
    • Untriaged
    • Not Supported
    • Need add manual test
    • 0
    • Untriaged

      Alert if any of the ovn controllers disconnected for a period of time from the southbound database using metric ovn_controller_southbound_database_connected.

      The metric updates every 2 minutes so please be mindful of this when creating the alert.

      If the controller is disconnected for 10 minutes, fire an alert.

      DoD: Merged to CNO and tested by QE

            [CORENET-2506] OVN-K alerts: add OVN controller disconnection alert

            Qiong Wang added a comment -

            Test case for this story:

            https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-53999 

            Test on aws(create cluster with cluster-bot) passed.

             

            Qiong Wang added a comment - Test case for this story: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-53999   Test on aws(create cluster with cluster-bot) passed.  

            rh-ee-qiowang : Could you also make sure to test this with HyperShift clusters?

            Surya Seetharaman added a comment - rh-ee-qiowang : Could you also make sure to test this with HyperShift clusters?

            rh-ee-qiowang : Could you please help verify this card and alert on OCP?

            Here are the steps to verify:

            1) Create an OVNK cluster which has this PR, we can use cluster-bot.

            2) oc rsh -n openshift-ovn-kubernetes ovnkube-node-2g7gl

            3) Run this command `iptables -I OUTPUT -p tcp --dport 9642 -j DROP` - that blocks the connections to SBDB for this controller

            4) We should then see the following happening in the metrics dashboard:

            5) Following that wait for 5mins and we should see the alert:

            6) We should also be able to see the details of the alert:

            Surya Seetharaman added a comment - rh-ee-qiowang : Could you please help verify this card and alert on OCP? Here are the steps to verify: 1) Create an OVNK cluster which has this PR, we can use cluster-bot. 2) oc rsh -n openshift-ovn-kubernetes ovnkube-node-2g7gl 3) Run this command `iptables -I OUTPUT -p tcp --dport 9642 -j DROP` - that blocks the connections to SBDB for this controller 4) We should then see the following happening in the metrics dashboard: 5) Following that wait for 5mins and we should see the alert: 6) We should also be able to see the details of the alert:

            Opened a PR for this, will create a cluster-bot cluster and see how to verify/test this

            Surya Seetharaman added a comment - Opened a PR for this, will create a cluster-bot cluster and see how to verify/test this

              sseethar Surya Seetharaman
              mkennell@redhat.com Martin Kennelly
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: