Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2093

Hybrid SRE Phases | HIVE

XMLWordPrintable

    • Hybrid SRE Phases | Hive
    • False
    • False
    • To Do
    • 0% To Do, 0% In Progress, 100% Done

      About This Epic

      The purpose of this Epic is to provide a Jira Template for tracking Service Team adoption of the Hybrid SRE model [V1]. The template contains an Epic with a list of Stories and Subtasks, one corresponding to each deliverable required for transitioning to Hybrid SRE.

      Before You Get Started

      • Reach out to the Service Delivery Strategy, Enablement + Architecture (SEA) team for an "Intro to Hybrid SRE" presentation to prepare your team to adopt the Hybrid SRE model. We can also help direct any questions that you might have as you kick off this process.

      HIVE

      • What's this service: Hive is a cluster provisioning API for OpenShift 4
      • Contacts :
      • Product Manager julim
      • Program Manager <>
      • Engineering Lead


      Guidelines for using this template

      1. Clone this epic (And subtasks!) Into your teams Jira project. You must check the option to clone subtasks in the clone wizard or the subtasks will not be cloned. This template will be updated to keep up with any newer versions of this spreadsheet (Similar to ROMS).
        https://docs.google.com/spreadsheets/d/1z0dt9wQnXmwix8Yu49lt-iBkTvvLT7ZkVS-bsnZ1Izc/edit#gid=1273005524
      2. Copy the Heat Map and On-Call participants sheets from the above Google Sheet as you will need to create versions for your team.
      3. Update Summary, Epic Name, and Description with your service information.
      4. Add any necessary labels based on your team's needs. Please do not remove any labels from the template – the SEA team will use these for tracking!
      5. Walk through the template with your team (looping in your SnP liaison as you see fit) and assign the Jira tickets as appropriate.
      6. Keep a record in the Jira ticket of what's been done before closing out.
      7. If a decision was made to not complete a task, please comment on the reasoning and risk assessment carried out.
      8. Service teams will be responsible for managing the Jira Epic/Stories/Tasks and keeping the status up to date, so that the SEA team can track progress via our Jira dashboard. The SEA team does not have someone allocated to manage or update Service Team work in Jira. 
      9. Join the #wg-hybrid-sre Internal Red Hat Slack channel. This community is a good resource for questions and knowledge-gathering. 

      Definitions:

      • SEA - Strategic enablement and Architecture Group
      • SnP Liason - Standards and Practices team member (an SRE in SEA) who works with each service team to accomplish their phased gates tasks.
      • Service Team: The developers, Project Managers, BU representatives etc. Responsible for deliveing a managed service at Red Hat.
      • Hybrid SRE Working Group - The subset of the service team + SEA + SRE who will ensure the tasks in the epic are complete (see: https://issues.redhat.com/browse/SEA-26)

        1.
        SEA presentation to Team (BU,PGM,PM, Eng) Sub-task Closed Undefined Mariusz Mazur
        2.
        Services Dashboard - Onboarding/Readiness Sub-task Closed Undefined Unassigned
        3.
        SRE onboarding status (if not GA) Sub-task Closed Undefined Unassigned
        4.
        IMS Backplane Access (3 days) Sub-task Closed Undefined Unassigned
        5.
        Status Board Onboarding Sub-task Closed Undefined Unassigned
        6.
        Pager Duty Enablement (1 day) Sub-task Closed Undefined Unassigned
        7.
        Pager Duty Escalation Policy Design (5 days) Sub-task Closed Undefined Unassigned
        8.
        App Interface Enablement (1 day) Sub-task Closed Undefined Unassigned
        9.
        Pager Duty Access (3 days) Sub-task Closed Undefined Unassigned
        10.
        IMS backplane configs Sub-task Closed Undefined Unassigned
        11.
        Manager Tasks for Enablement Sub-task Closed Undefined Unassigned
        12.
        Identify On-Call Participants (Sheet 2) & Create Heat Map Sub-task Closed Undefined Unassigned
        13.
        OpenShift Administration Training (DO280 if needed) Sub-task Closed Undefined Unassigned
        14.
        ROMS status and alignment (if applicable) Sub-task Closed Undefined Unassigned
        15.
        Web RCA onboarding (1 day) Sub-task Closed Undefined Unassigned
        16.
        Complete Incident Response & SRE Training Sub-task Closed Undefined Unassigned
        17.
        Handover Meeting Process between Dev & Central SRE (Service Delivery) (3 days) Sub-task Closed Undefined Unassigned
        18.
        SRE Shadowing (10 days) [Optional but recommended] Sub-task Closed Undefined Unassigned
        19.
        Practice Incident response with a Kubefuffle game or other incident practice scenario. Tests the access and process mastery. (5 days) Sub-task Closed Undefined Unassigned
        20.
        Incident escalation policy between Eng and SRE (5 days) Sub-task Closed Undefined Unassigned
        21.
        Service Team SRE, and CEE alignment for customer issues Sub-task Closed Undefined Unassigned
        22.
        Determine current Signal to Noise Ratio (2 Days) Sub-task Closed Undefined Unassigned
        23.
        Where do want to be compared to where we are today? (5 days) Sub-task Closed Undefined Unassigned
        24.
        Identify Body of work to decrease Signal to noise Ratio, ensuring alerts are meaningful to customer experience. Sub-task Closed Undefined Unassigned
        25.
        Define Prioritization process for RFE's and bugs found during an incident Sub-task Closed Undefined Unassigned
        26.
        Identify a body of work and permanent process to improve 4 key DORA metrics (if required) Sub-task Closed Undefined Unassigned
        27.
        Obtain Current DORA metric Ranking Sub-task Closed Undefined Unassigned
        28.
        Train team on observability, and what to do with it (1 Day) Sub-task Closed Undefined Unassigned
        29.
        Service Maturity: Disaster Planning Office Hours session (1 Day) Sub-task Closed Undefined Unassigned
        30.
        Process design completed and implementation begins to improve tracked service metrics (4 DORA metrics & Signal to Noise Ratio). Sub-task Closed Undefined Unassigned
        31.
        Contribute to Community -- Operate First (Optional) Sub-task Closed Undefined Unassigned
        32.
        Team handles the vast majority of alerts and is feeding alert RCAs back into their prioritization process. Sub-task Closed Undefined Unassigned
        33.
        Participate in "Buddy System" for new teams joining the Hybrid SRE model Sub-task Closed Undefined Unassigned
        34.
        Toil is manageable (below 50% of operational time) and on track to be automated where possible Sub-task Closed Undefined Unassigned
        35.
        Escalations to SRE are trending downward on goal to be a very small percentage (<10%) Sub-task Closed Undefined Unassigned
        36.
        Give feedback into the prodcess and improve it for other teams in the future. Sub-task Closed Undefined Unassigned

            mworthin@redhat.com Mike Worthington
            kat@redhat.com Kat Keane
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: