Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-2827

Retrieve ec2 events for machines and produce metrics and status.condition


    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Retrieve ec2 events for machines and produce metrics and status.condition

      2. What is the nature and description of the request?
      EC2 instances can have events scheduled to them by AWS. An example of this is degraded hardware maintenance where the instance needs to be stopped and started to move to new hardware. While running services like ROSA and OSD where customers can bring their own AWS account, SRE has no way to intercept notifications of these scheduled events and thus are blind to them occurring. This is highly undesirable when the instance is a control plane machine. 


      This RFE proposes that MAPI (machine-api-operator) have the ability to DescribeInstanceStatus and publish any retrieved scheduled events into the given machines condition.status. Additionally a metric should be raised indicating that machine-X has scheduled event-X.

      3. Why does the customer need this? (List the business requirements here)

      A customer needs so SRE can provide business logic based on these metrics. 

      For example - if a metric is raised for a control plane node that requires a start/stop to migrate to new hard. SRE can best co-ordinate this with either manual or automated interactions. 


      Additionally if the machine is a worker or infra, SRE can employ automation to start/stop instance to resolve the scheduled maintenance. 


      Otherwise customers instances will be stopped when the maintenance deadline arrives. This results in alerts for SRE and lost compute for customer - undesirable for both parties. As apart of managed service, SRE should be proactively handling these conditions on behalf of the customer. 


      This was previously explored to a degree here: https://github.com/openshift/enhancements/pull/341


      The status.condition request is there to provide verbosity to anyone inspecting the scheduled event. This removes the requirement for them to access AWS directly for verbose information. 
      4. List any affected packages or components.



      cc adejong+hosted wgordon@redhat.com 

            rhn-support-dhardie Duncan Hardie
            dofinn Dominic Finn (Inactive)
            4 Vote for this issue
            9 Start watching this issue