Uploaded image for project: 'CoreOS OCP'
  1. CoreOS OCP
  2. COS-2344

Impact: ensure fixes land for large inodes

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • 0
    • 0

      We're asking the following questions to evaluate whether or not OCPBUGS-16410 warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid recommending an update which introduces new risk or reduces cluster functionality in any way. In the absence of a declared update risk (the status quo), there is some risk that the existing fleet updates into the at-risk releases. Depending on the bug and estimated risk, leaving the update risk undeclared may be acceptable.

      Sample answers are provided to give more context and the ImpactStatementRequested label has been added to OCPBUGS-16410. When responding, please move this ticket to Code Review. The expectation is that the assignee answers these questions.

      Which 4.y.z to 4.y'.z' updates increase vulnerability?

      • reasoning: This allows us to populate from and to in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is exposed.
      • example: Customers upgrading from any 4.y (or specific 4.y.z) to 4.(y+1).z'. Use oc adm upgrade to show your current cluster version.
      • Any 4.11 -> 4.12
      • Impact for -> 4.11 with layered products still under investigation

      Which types of clusters?

      • reasoning: This allows us to populate matchingRules in conditional update recommendations for "clusters like $THIS".
      • example: GCP clusters with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.

      Clusters with notes with >3TB XFS filesystems mounted as root.


      The two questions above are sufficient to declare an initial update risk, and we would like as much detail as possible on them as quickly as you can get it. Perfectly crisp responses are nice, but are not required. For example "it seems like these platforms are involved, because..." in a day 1 draft impact statement is helpful, even if you follow up with "actually, it was these other platforms" on day 3. In the absence of a response within 7 days, we may or may not declare a conditional update risk based on our current understanding of the issue.

      If you can, answers to the following questions will make the conditional risk declaration more actionable for customers.

      What is the impact? Is it serious enough to warrant removing update recommendations?

      • reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
      • example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
      • example: Up to 90 seconds of API downtime. Check with curl ....
      • example: etcd loses quorum and you have to restore from backup. Check with ssh ....

      Potentially high for customers with >2T (to be conservative) root filesystems. Basically zero for anything below that.

      How involved is remediation?

      • reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
      • example: Issue resolves itself after five minutes.
      • example: Admin can run a single: oc ....
      • example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

      "Somewhat involved"

      Is this a regression?

      • reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
      • example: No, it has always been like this we just never noticed.
      • example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

      Technically, no. The underlying bug has been there since 4.1, but was dormant until we added extensions (client side installed packages).

            walters@redhat.com Colin Walters
            trking W. Trevor King
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: