Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4768

Ability to Configure OpenAI Client Retries to tolerate > couple second failures for teacher model

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • None
    • InstructLab - CLI
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Currently: only the default max retry amount of 2 can be utilized for the openai chat client utilized to talk to the teacher model with SDG. Based on this default: if a teacher model backend is unavailable for more than a couple seconds: the sdg pipeline run will fail and manual intervention will be required to restart the process.

       

      We can provide the ability for users to specify this in SDG. One valid use case this will enable is some users have a pool of redundant teacher models behind a load balancer that are health checked. It could take up to 30 seconds for some users to remove a bad teacher model from rotation depending on how the load balancer is configured to health check. By allowing users to control the open ai client retries: these users can control how long their pipelines will retry through attempts and timeouts (already configurable in pipeline) to ensure consistent successful runs of sdg.

       

      More info in: https://github.com/instructlab/instructlab/issues/3516

              Unassigned Unassigned
              lisowskiibm Tyler Lisowski (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: