Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-362

CI: fail update suite if any ClusterOperator go Available=False

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • OTA 243, OTA 244, OTA 245

      These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

      Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      

      And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

            [OTA-362] CI: fail update suite if any ClusterOperator go Available=False

            I've set all related bugs to priority Major and left comments indicating we'd like to have these addressed by 4.16.

            Scott Dodson added a comment - I've set all related bugs to priority Major and left comments indicating we'd like to have these addressed by 4.16.

            origin#27231 landed

            W. Trevor King added a comment - origin#27231 landed

            Lalatendu Mohanty added a comment - - edited

            rhn-engineering-dgoodwinThis is high priority for us. We want to make progress on this before 4.15 release. This is technical debt we should have addressed long back. I am happy to give it you if you want to take over this and we would help anyway we can. Please sync with trking about how we can pass it to you.

            Lalatendu Mohanty added a comment - - edited rhn-engineering-dgoodwin This is high priority for us. We want to make progress on this before 4.15 release. This is technical debt we should have addressed long back. I am happy to give it you if you want to take over this and we would help anyway we can. Please sync with trking about how we can pass it to you.

            lmohanty@redhat.com trking just wondering how this effort sits in priority, specific to https://github.com/openshift/origin/pull/27231. I'm working on correlating alerts SD struggles with against CI alerts that fire and this one is a top offender, the change looked like a really big win. If you'd like TRT to take over that PR and wrap it up we'd be happy to just say the word. If it's something you all would like to complete that's fine I'm just looking to see where things are at and if we can help.

            Devan Goodwin added a comment - lmohanty@redhat.com trking just wondering how this effort sits in priority, specific to https://github.com/openshift/origin/pull/27231 . I'm working on correlating alerts SD struggles with against CI alerts that fire and this one is a top offender, the change looked like a really big win. If you'd like TRT to take over that PR and wrap it up we'd be happy to just say the word. If it's something you all would like to complete that's fine I'm just looking to see where things are at and if we can help.

            Lalatendu Mohanty added a comment - - edited

            Created card for operators going degraded during upgrade https://issues.redhat.com/browse/OTA-699

            Changed the title of this Jira card to only cover available condition. Also we need a new card to cover the work to communicate available and degraded condition to teams.

            Lalatendu Mohanty added a comment - - edited Created card for operators going degraded during upgrade https://issues.redhat.com/browse/OTA-699 Changed the title of this Jira card to only cover available condition. Also we need a new card to cover the work to communicate available and degraded condition to teams.

            Slack thread to understand what others (staff engineers, TRT) think about this card https://coreos.slack.com/archives/CEGKQ43CP/p1654625537501139

            Lalatendu Mohanty added a comment - Slack thread to understand what others (staff engineers, TRT) think about this card https://coreos.slack.com/archives/CEGKQ43CP/p1654625537501139

            We should do it during beginning of 4.12 release cycle, so that teams can accommodate the extra work coming out of this.

            Lalatendu Mohanty added a comment - We should do it during beginning of 4.12 release cycle, so that teams can accommodate the extra work coming out of this.

            Lets take this to pillar arch call and see what other teams has to say about this.

            Lalatendu Mohanty added a comment - Lets take this to pillar arch call and see what other teams has to say about this.

            W. Trevor King added a comment - - edited

            Clayton floated origin#25920 for updates.

            W. Trevor King added a comment - - edited Clayton floated origin#25920 for updates.

            Related, but not for update jobs: origin#25918.

            W. Trevor King added a comment - Related, but not for update jobs: origin#25918 .

              trking W. Trevor King
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: