[OBSDA-455] RHDE Observability MVP - Tech Preview - Red Hat Issue Tracker

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: rhosdt-3.4
Affects Version/s: None
Component/s: PM Tracing
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PM Score:
500
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

Reach:
1,000
Impact:
4
Confidence:
50% (Low)
Effort:
4

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Priority Data:
PX Impact Score:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Red hat Device Edge customers can observe their fleet of RHDE based edge devices at a central location in regards to Metrics (base system metric like CPU utilisation, custom metrics like queue length etc), Logging, Events and Traces.

Goals (aka. expected user outcomes)

As RHDE operator, I want to opt in to observe my edge devices in regards to base and custom metrics, logs, events and so that I can centrally monitor the status of my edge fleet and respond to events.
As RHDE operator I want to be able to remotely change the observability details, e.g. to get more information for troubleshooting.
As edge solution developer, I want to be able to expose my custom metrics and events so that my solution can be operated

Clarification: events stand for the outcome of "oc get events"

Requirements

Functional

Observability must be opt-in, i.e. customers can add it if wanted (e.g. “dnf install microshift-observability”)
Observability must be configurable, to control the resource usage it induces. Configuration should be possible locally on the edge device, or centrally (e.g. through gitops).
Configuration pre-sets should be provided:
1. Minimum (just the bare minimum, e.g. CPU, RAM, DISK, total pod usage, fatal log messages)
2. Medium (a bit more details)
3. Maximum (everything)
Metrics
1. system base / hardware (CPU, RAM, DISK, IO, SWAP, etc)
2. K8S layer (kublet, pod, namespaces)
3. Application layer (custom metrics exposed via metrics endpoints, or pushed via API
Log forwarding
1. log forwarding of system logs
2. Log forwarding of application logs
Event forwarding
1. Events are like logs, but with a high priority, i.e. they should not be lost and transmitted first. Events are conditions e.g. running out of disk space, audit events etc.
All communication is initiated by the edge system, information is pushed from edge to core system. Rational: Edge system might be behind firewall, not acticitly reachable from the core system. Standard firewall friendly protocols should be used (e.g. HTTPS, WebSockets)
All communication is encrypted in transit.
All communication is authenticated (e.g. using TLS client certs).
Non data loss during offline periods. Edge devices can be disconnected/offline for days/weeks. Observability data should be buffered locally on the edge device. The “edge local retention” time needs to be configurable by duration and size-on-disk constraints (e.g. set aside 100G of disk space on the edge device where during offline phase, observability data is buffered. When disk is full, oldest data is deleted). Local deletion of data needs to be logged as error event with exact from-to timestamp data has been lost. This need to be configurable separately for Metrics, Logging, Events (Metrics might be less important then logging then events).
Solution has to work on offline/air gapped/isolated networks, i.e. no direct /indirect connection to the internet.
Core system (Data Receiver / Analytics) needs to

Non-Functional:

Observability data / protocols should be based on open standards so that customers can choose their backend systems. (e.g. ACM, Dynatrace, Splunk, AWS, …). Alternatively, a wide range of adapters should be available.
Podman only deployments, i.e. where MicroShift / K8S is not used needs to be supported, but it does not need to be part of this MVP.

Background

Other considerations

Needs to be interoperable with ACM Observability, e.g. ACM managing a fleet of RHDE+OCP clusters, that should fit in seamlessly

is depended on by

OCPSTRAT-1549 MicroShift observability via otel integration

Release Pending

is related to

ACM-9865 ACM Observability-Addon does not work for Microshift

Closed

relates to

TRACING-4157 RHDE: OpenTelemetry Fleet Management Integration

Closed

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements

Functional

Non-Functional:

Background

Other considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates