Feature Overview
- The customer is experiencing long VM restart time of ~15 minutes when a host fails.
Goals
- The expected user outcome is for the VM to restart within seconds on another node in case of a host failure, allowing applications/servers to have minimal downtime.
Requirements
Even with the improvements in FAR (90 seconds to host recovery), this is a small footprint critical system where a failed VM needs to start on another host as quickly as possible, and the failed host rebooted.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Questions to answer
- What system checks or health checks are performed on an IPI installation with the MachineHealthCheck controller?
- With the MachineHealthCheck controller, how much time or range of time will the VM take to restart?
Background, and strategic fit
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Documentation Considerations
Questions to be addressed:
- What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
- Customer needs a reference material to show what health checks are done in the case of a host failure and best practices they need to take in order to minimize downtime
- Does this feature have doc impact?
- Yes
- What concepts do customers need to understand to be successful in [action]?
- The customer needs further information on the above Questions to Answer
- How do we expect customers will use the feature? For what purpose(s)?
- They will use this feature to minimize VM downtime due to a host failure
- What reference material might a customer want/need to complete [action]?
-
- Documentation listing steps they need to take to remediate host failure and how long the VM may need to take to restart on another node
- Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
- N/A.
- What is the doc impact (New Content, Updates to existing content, or Release Note)?
- Release Note
- is blocked by
-
CNV-36134 Reduce time to redeploy VM scheduled on unhealthy node on 4.15.1
-
- Closed
-
- is related to
-
CNV-30903 Knowledge base article an VM recovery time on 4.14
-
- Closed
-
-
CNV-58935 spike: Research how to decrease time to node failure detection
-
- Closed
-
- relates to
-
CNV-60410 Faster remediation start with baremetal events
-
- In Progress
-
- links to