Uploaded image for project: 'OpenShift Hosted Control Plane'
  1. OpenShift Hosted Control Plane
  2. HOSTEDCP-738

Hybrid SRE Phases | HOSTEDCP

XMLWordPrintable

    • Hybrid SRE Phases | HOSTEDCP
    • False
    • False
    • To Do
    • 53
    • 53% 53%
    • 0
    • 0
    • 0

      About This Epic

      The purpose of this Epic is to provide a Jira Template for tracking Service Team adoption of the Hybrid SRE model [V1]. The template contains an Epic with a list of Stories and Subtasks, one corresponding to each deliverable required for transitioning to Hybrid SRE.

      Before You Get Started

      • Reach out to the Service Delivery Strategy, Enablement + Architecture (SEA) team for an "Intro to Hybrid SRE" presentation to prepare your team to adopt the Hybrid SRE model. We can also help direct any questions that you might have as you kick off this process.

      HyperShift

      • What's this service: HyperShift is middleware for hosting OpenShift control planes at scale that solves for cost and time to provision, as well as portability across cloud service providers with strong separation of concerns between management and workloads. Clusters are fully compliant OpenShift Container Platform (OCP) clusters and are compatible with standard OCP and Kubernetes toolchains.
      • Contacts: Product Manager azaalouk , Manager asegurap1@redhat.com , Engineering Lead agarcial@redhat.com

      Guidelines for using this template

      1. Clone this epic (And subtasks!) Into your teams Jira project. You must check the option to clone subtasks in the clone wizard or the subtasks will not be cloned. This template will be updated to keep up with any newer versions of this spreadsheet (Similar to ROMS).
        https://docs.google.com/spreadsheets/d/1z0dt9wQnXmwix8Yu49lt-iBkTvvLT7ZkVS-bsnZ1Izc/edit#gid=1273005524
      2. Copy the Heat Map and On-Call participants sheets from the above Google Sheet as you will need to create versions for your team.
      3. Update Summary, Epic Name, and Description with your service information.
      4. Add any necessary labels based on your team's needs. Please do not remove any labels from the template – the SEA team will use these for tracking!
      5. Walk through the template with your team (looping in your SnP liaison as you see fit) and assign the Jira tickets as appropriate.
      6. Keep a record in the Jira ticket of what's been done before closing out.
      7. If a decision was made to not complete a task, please comment on the reasoning and risk assessment carried out.
      8. Service teams will be responsible for managing the Jira Epic/Stories/Tasks and keeping the status up to date, so that the SEA team can track progress via our Jira dashboard. The SEA team does not have someone allocated to manage or update Service Team work in Jira. 
      9. Join the #wg-hybrid-sre Internal Red Hat Slack channel. This community is a good resource for questions and knowledge-gathering. 

      Definitions:

      • SEA - Strategic enablement and Architecture Group
      • SnP Liason - Standards and Practices team member (an SRE in SEA) who works with each service team to accomplish their phased gates tasks.
      • Service Team: The developers, Project Managers, BU representatives etc. Responsible for deliveing a managed service at Red Hat.
      • Hybrid SRE Working Group - The subset of the service team + SEA + SRE who will ensure the tasks in the epic are complete (see: https://issues.redhat.com/browse/SEA-26)

        1.
        Phase 0: Form Service's Working Group for Hybrid SRE Sub-task Closed Undefined Unassigned
        2.
        Review and refine scope in Jira for requirements to accomplish Hybrid SRE Sub-task Closed Undefined Unassigned
        3.
        Service Deployment Discovery: State of Deployments Sub-task Closed Undefined Unassigned
        4.
        Phase 0: SEA presentation to Team (BU,PGM,PM, Eng) Sub-task Closed Undefined Unassigned
        5.
        Engineering is Self Managing Deployments to Prod Sub-task Closed Undefined Unassigned
        6.
        Services Dashboard - Onboarding/Readiness Sub-task Closed Undefined Unassigned
        7.
        SRE onboarding status (if not GA) Sub-task Closed Undefined Unassigned
        8.
        IMS Backplane Access (3 days) Sub-task Closed Undefined Unassigned
        9.
        Status Board Onboarding Sub-task Closed Undefined Unassigned
        10.
        Pager Duty Enablement (1 day) Sub-task Closed Undefined Unassigned
        11.
        Pager Duty Escalation Policy Design (5 days) Sub-task Closed Undefined Unassigned
        12.
        App Interface Enablement (1 day) Sub-task Closed Undefined Unassigned
        13.
        Pager Duty Access (3 days) Sub-task Closed Undefined Unassigned
        14.
        IMS backplane configs Sub-task Closed Undefined Unassigned
        15.
        Manager Tasks for Enablement Sub-task Closed Undefined Unassigned
        16.
        Identify On-Call Participants (Sheet 2) & Create Heat Map Sub-task Closed Undefined Unassigned
        17.
        OpenShift Administration Training (DO280 if needed) Sub-task Closed Undefined Unassigned
        18.
        ROMS status and alignment (if applicable) Sub-task Closed Undefined Unassigned
        19.
        Web RCA onboarding (1 day) Sub-task Closed Undefined Unassigned
        20.
        Complete Incident Response & SRE Training Sub-task Closed Undefined Unassigned
        21.
        Handover Meeting Process between Dev & Central SRE (Service Delivery) (3 days) Sub-task Closed Undefined Unassigned
        22.
        SRE Shadowing (10 days) [Optional but recommended] Sub-task Closed Undefined Unassigned
        23.
        Practice Incident response with a Kubefuffle game or other incident practice scenario. Tests the access and process mastery. (5 days) Sub-task Closed Undefined Unassigned
        24.
        Incident escalation policy between Eng and SRE (5 days) Sub-task Closed Undefined Unassigned
        25.
        Service Team SRE, and CEE alignment for customer issues Sub-task Closed Undefined Unassigned
        26.
        Determine current Signal to Noise Ratio (2 Days) Sub-task Closed Undefined Unassigned
        27.
        Where do want to be compared to where we are today? (5 days) Sub-task Closed Undefined Unassigned
        28.
        Identify Body of work to decrease Signal to noise Ratio, ensuring alerts are meaningful to customer experience. Sub-task Closed Undefined Unassigned
        29.
        Define Prioritization process for RFE's and bugs found during an incident Sub-task Closed Undefined Unassigned
        30.
        Identify a body of work and permanent process to improve 4 key DORA metrics (if required) Sub-task Closed Undefined Unassigned
        31.
        Obtain Current DORA metric Ranking Sub-task Closed Undefined Unassigned
        32.
        Phase 3: Train team on observability, and what to do with it (1 Day) Sub-task Closed Undefined Unassigned
        33.
        Phase 3: Service Maturity: Disaster Planning Office Hours session (1 Day) Sub-task Closed Undefined Unassigned
        34.
        Phase 4: Process design completed and implementation begins to improve tracked service metrics (4 DORA metrics & Signal to Noise Ratio). Sub-task Closed Undefined Unassigned
        35.
        Phase 4: Contribute to Community -- Operate First (Optional) Sub-task Closed Undefined Unassigned
        36.
        Phase 4: Team handles the vast majority of alerts and is feeding alert RCAs back into their prioritization process. Sub-task Closed Undefined Unassigned
        37.
        Phase 4: Participate in "Buddy System" for new teams joining the Hybrid SRE model Sub-task Closed Undefined Unassigned
        38.
        Phase 4: Toil is manageable (below 50% of operational time) and on track to be automated where possible Sub-task Closed Undefined Unassigned
        39.
        Phase 4: Escalations to SRE are trending downward on goal to be a very small percentage (<10%) Sub-task Closed Undefined Unassigned
        40.
        Phase 4: Give feedback into the process and improve it for other teams in the future. Sub-task Closed Undefined Unassigned

            asegurap1@redhat.com Antoni Segura Puimedon
            kat@redhat.com Kat Keane
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: