XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request

      Improved network monitoring

      2. What is the nature and description of the request?

      Currently, OCP has no alerting rules for network issues, such as large packet drops, from nics/softnet/ovs.  large spikes in multicast/broadcast traffic (Floods).

       

      # RX drop %
      rate(node_network_receive_drop_total[2m] ) / rate(node_network_receive_packets_total[2m]) > 0.05
      # RX drop %
      rate(node_network_transmit_drop_total[2m] ) / rate(node_network_receive_packets_total[2m]) > 0.05
      # softnet drop %
      rate(node_softnet_dropped_total[2m] ) / rate(node_softnet_processed_total[2m]) > 0.05
      
      # unexpected protocol/packets
      rate(node_network_receive_nohandler_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      
      # multicast flood (not sure if a threshold on multicast would be appropriate) 
      rate(node_network_receive_multicast_total [2m] ) > 100k (flood). 
      #prehaps if we get more broadcast/multicast %90 but this will trigger on idle links (arps etc).
      rate(node_network_receive_multicast_total [2m] ) > rate(node_network_receive_packets_total[2m]) > 0.9

      The percentage could change the severity of the issue, similar to the storage alerts where a the percentage of free space dictates the severity of the issue.

      https://github.com/prometheus-community/helm-charts/blob/211245fa1929d5ee581696305087ac551cafdcef/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/node-exporter.yaml#L300C1-L300C36

      3. Why does the customer need this? (List the business requirements here)

      Customers do not currently get notified when network interfaces are saturated for sustained periods.
      This leads to outages and connectivity issues (Especially when a large portion of tx/rx drops are occurring)

       

      4. List any affected packages or components.

      monitoring manifests

              rh-ee-rfloren Roger Florén
              rhn-support-tidawson Tim Dawson
              None
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                None
                None