Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-2057

Enhanced Logging System - Logging System Overhaul for Benchmark Workloads

    • False
    • Hide

      None

      Show
      None
    • False

      Summary

      As Jbenchmark evolves into a customer-facing product, it must behave as a unified service that receives parameters, runs a benchmark, and produces structured, transparent, and insightful logs. When errors occur, we must be able to pinpoint their source — whether it's resource constraints during model spin-up, an authentication failure, an infrastructure issue (e.g., wrong node type, unavailable spot instance), or a bug in our own stack.

      This means logs can no longer be fragmented per container. Instead, we need consolidated, contextualized logging at the workload level, with a focus on clarity, troubleshooting, and future observability.

      Goals

      • Provide a clear, single point of truth per benchmark workload for all key events and states.
      • Enable full lifecycle tracking: setup → model loading → inference → output → teardown.
      • Categorize and distinguish errors (infra, runtime, auth, quota, bugs).
      • Structure logs so they’re searchable, filterable, and queryable by both internal teams and external users.
      • Expose logs through a user-friendly interface, such as Kibana, backed by a system like OpenSearch.
      • Lay the foundation for proactive monitoring and future dashboards.

      Tasks:

      1. Write a detailed spec document, including:
        • Target logging behaviors and developer/operator expectations.
        • Logging structure, content, severity levels, and naming.
        • Log aggregation and exposure strategy (e.g., OpenSearch + Kibana).
        • Examples of filtered use cases (e.g., "show only workloads that failed due to auth").
      1. Propose multiple architectural alternatives:
        • Logging destinations and formats (stdout, file, system-level).
        • How logs are shipped and ingested.
        • Options for log visualization tools.
      1. Discuss with team:
        • Internal design review sessions to gather input and challenge assumptions.
      1. Approval:
        • Submit finalized spec and architecture for Aviran’s approval.
      1. Planning:
        • Aviran assigns ownership of the Epic 
        • Team breaks the scope into stories under this Epic.

      Phase 2: Implementation

      Begins only after Phase 1 is approved and tasks are broken down.

       

      Note:
      As part of the logging overhaul, logs should not only be structured and accessible programmatically — they must also be searchable and filterable via a user-friendly interface. A recommended approach is to stream logs into a centralized system like OpenSearch, and provide a Kibana interface on top of it. This would enable both internal and external users to:

      • Filter logs by workload ID, model name, date, error type, or severity.
      • Easily visualize and investigate failures across runs.
      • Gain visibility into recurring issues and usage patterns without requiring access to the raw system.

       

              rh-ee-abadli Aviran Badli (Inactive)
              rh-ee-abadli Aviran Badli (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: