Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-455

RHDE Observability MVP - Tech Preview

XMLWordPrintable

    • False
    • None
    • False
    • Not Selected
    • 500
    • 25% To Do, 25% In Progress, 50% Done
    • 1,000
    • 4
    • 50% (Low)
    • 4
    • 500

      Feature Overview (aka. Goal Summary)

      Red hat Device Edge customers can observe their fleet of RHDE based edge devices at a central location in regards to Metrics (base system metric like CPU utilisation, custom metrics like queue length etc), Logging, Events and Traces.

      Goals (aka. expected user outcomes)

      1. As RHDE operator, I want to opt in to observe my edge devices in regards to base and custom metrics, logs, events and  so that I can centrally monitor the status of my edge fleet and respond to events.
      2. As RHDE operator I want to be able to remotely change the observability details, e.g. to get more information for troubleshooting.
      3. As edge solution developer, I want to be able to expose my custom metrics and events so that my solution can be operated

      Clarification: events stand for the outcome of "oc get events"

      Requirements

      Functional

      1. Observability must be opt-in, i.e. customers can add it if wanted (e.g. “dnf install microshift-observability”)
      2. Observability must be configurable, to control the resource usage it induces. Configuration should be possible locally on the edge device, or centrally (e.g. through gitops).
      3. Configuration pre-sets should be provided:
        1. Minimum (just the bare minimum, e.g. CPU, RAM, DISK, total pod usage, fatal log messages)
        2. Medium (a bit more details)
        3. Maximum (everything)
      4. Metrics
        1. system base / hardware (CPU, RAM, DISK, IO, SWAP, etc)
        2. K8S layer (kublet, pod, namespaces)
        3. Application layer (custom metrics exposed via metrics endpoints, or pushed via API
      5. Log forwarding
        1. log forwarding of system logs 
        2. Log forwarding of application logs
      6. Event forwarding
        1. Events are like logs, but with a high priority, i.e. they should not be lost and transmitted first. Events are conditions e.g. running out of disk space, audit events etc.
      7. All communication is initiated by the edge system, information is pushed from edge to core system. Rational: Edge system might be behind firewall, not acticitly reachable from the core system. Standard firewall friendly protocols should be used (e.g. HTTPS, WebSockets)
      8. All communication is encrypted in transit.
      9. All communication is authenticated (e.g. using TLS client certs).
      10. Non data loss during offline periods. Edge devices can be disconnected/offline for days/weeks. Observability data should be buffered locally on the edge device. The “edge local retention” time needs to be configurable by duration and size-on-disk constraints (e.g. set aside 100G of disk space on the edge device where during offline phase, observability data is buffered. When disk is full, oldest data is deleted). Local deletion of data needs to be logged as error event with exact from-to timestamp data has been lost. This need to be configurable separately for Metrics, Logging, Events (Metrics might be less important then logging then events).
      11. Solution has to work on offline/air gapped/isolated networks, i.e. no direct /indirect connection to the internet.
      12. Core system (Data Receiver / Analytics) needs to 

      Non-Functional:

      1. Observability data / protocols should be based on open standards so that customers can choose their backend systems. (e.g. ACM, Dynatrace, Splunk, AWS, …). Alternatively, a wide range of adapters should be available. 
      2. Podman only deployments, i.e. where MicroShift / K8S is not used needs to be supported, but it does not need to be part of this MVP.

      Background

       

      Other considerations

      1. Needs to be interoperable with ACM Observability, e.g. ACM managing a fleet of RHDE+OCP clusters, that should fit in seamlessly

       

              rh-ee-jgomezse Jose Gomez-Selles
              rh-ee-jgomezse Jose Gomez-Selles
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: