Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1284

Use include-list for enabled metrics configuration

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Operator
    • None
    • False
    • None
    • False
    • OCPSTRAT-156 - Netobserv operator: Make configuration simpler
    • Hide
      The FlowCollector API is modified as such:
      - the setting "processor.metrics.ignoreTags" is deprecated and will be removed in FlowCollector v1beta2
      - it is replaced with a new setting "processor.metrics.includeList", which uses the opposite approach: instead of an exclusion list, it is now an inclusion list.

      This change will allow smoother transitions in future releases, when new metrics are added, to make sure they will not cause cluster monitoring instability with too many metrics being generated unintentionally.

      This change also moves away from the metrics tagging system: instead of relying on tags to include/exclude metrics, which could end up being quite complex, the desired metric names need to be provided directly. The list of available metrics is documented.

      If "ignoreTags" is explicitly set in your FlowCollector configuration, it is recommended to remove it and define "includeList" instead, or to move back to using the default values. By not doing so, new metrics might be generated on upgrades and you should make sure they don't cause too much memory consumption increase on Prometheus.

      If "ignoreTags" isn't explicitly set and you don't set "includeList", the Operator will keep using the default metrics, which have a more modest impact on Prometheus.
      Show
      The FlowCollector API is modified as such: - the setting "processor.metrics.ignoreTags" is deprecated and will be removed in FlowCollector v1beta2 - it is replaced with a new setting "processor.metrics.includeList", which uses the opposite approach: instead of an exclusion list, it is now an inclusion list. This change will allow smoother transitions in future releases, when new metrics are added, to make sure they will not cause cluster monitoring instability with too many metrics being generated unintentionally. This change also moves away from the metrics tagging system: instead of relying on tags to include/exclude metrics, which could end up being quite complex, the desired metric names need to be provided directly. The list of available metrics is documented. If "ignoreTags" is explicitly set in your FlowCollector configuration, it is recommended to remove it and define "includeList" instead, or to move back to using the default values. By not doing so, new metrics might be generated on upgrades and you should make sure they don't cause too much memory consumption increase on Prometheus. If "ignoreTags" isn't explicitly set and you don't set "includeList", the Operator will keep using the default metrics, which have a more modest impact on Prometheus.
    • NetObserv - Sprint 242, NetObserv - Sprint 243, NetObserv - Sprint 244

      Currently, metrics configuration uses a black-listing approach with a tags system. Since enabling more and more metrics increases cluster resource usage, it would be better to switch to a white-listing approach, where user only select what they need.

      This is also safer during upgrades, when users already have this setting configured explictly, since in that case the new default won't apply and, with black-listing, new metrics could be automatically enabled without the user noticing.

      On top of that, it's confusing to have overlap between tags. 

      We should think about more explicit tags (including 'all' mention like 'all_namespaces', or forcing fully qualified names like 'ingress_namespaces_packets')

       

      NOTE FOR QE

      You can read the release note text for the user facing changes. One special thing to test will be the upgrade scenario, especially after we add new metrics (such as RTT, drops... e.g. https://github.com/netobserv/network-observability-operator/pull/408) => we need to make sure there isn't any unintended metric generated beyond the defaults. This is kind of a chicken-egg problem as these PRs are bocked by this one, so this particluar thing will have to be tested after both are merged.

            jtakvori Joel Takvorian
            jtakvori Joel Takvorian
            Nathan Weinberg Nathan Weinberg
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: