-
Initiative
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
False
-
-
False
Summary
As Jbenchmark evolves into a customer-facing product, it must behave as a unified service that receives parameters, runs a benchmark, and produces structured, transparent, and insightful logs. When errors occur, we must be able to pinpoint their source — whether it's resource constraints during model spin-up, an authentication failure, an infrastructure issue (e.g., wrong node type, unavailable spot instance), or a bug in our own stack.
This means logs can no longer be fragmented per container. Instead, we need consolidated, contextualized logging at the workload level, with a focus on clarity, troubleshooting, and future observability.
Goals
- Provide a clear, single point of truth per benchmark workload for all key events and states.
- Enable full lifecycle tracking: setup → model loading → inference → output → teardown.
- Categorize and distinguish errors (infra, runtime, auth, quota, bugs).
- Structure logs so they’re searchable, filterable, and queryable by both internal teams and external users.
- Expose logs through a user-friendly interface, such as Kibana, backed by a system like OpenSearch.
- Lay the foundation for proactive monitoring and future dashboards.
Tasks:
- Write a detailed spec document, including:
-
- Target logging behaviors and developer/operator expectations.
-
- Logging structure, content, severity levels, and naming.
-
- Log aggregation and exposure strategy (e.g., OpenSearch + Kibana).
-
- Examples of filtered use cases (e.g., "show only workloads that failed due to auth").
- Propose multiple architectural alternatives:
-
- Logging destinations and formats (stdout, file, system-level).
-
- How logs are shipped and ingested.
-
- Options for log visualization tools.
- Discuss with team:
-
- Internal design review sessions to gather input and challenge assumptions.
- Approval:
-
- Submit finalized spec and architecture for Aviran’s approval.
- Planning:
-
- Aviran assigns ownership of the Epic
-
- Team breaks the scope into stories under this Epic.
Phase 2: Implementation
Begins only after Phase 1 is approved and tasks are broken down.
Note:
As part of the logging overhaul, logs should not only be structured and accessible programmatically — they must also be searchable and filterable via a user-friendly interface. A recommended approach is to stream logs into a centralized system like OpenSearch, and provide a Kibana interface on top of it. This would enable both internal and external users to:
- Filter logs by workload ID, model name, date, error type, or severity.
- Easily visualize and investigate failures across runs.
- Gain visibility into recurring issues and usage patterns without requiring access to the raw system.