Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2271

Upgrade failures and MCDPivotError Alert Firing on GCP realtime kernel

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Undefined
    • None
    • 4.12
    • Important
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      We have detected a noticable drop in GCP upgrade success rates (90%->85% over the last week).

      periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-rt-upgrade seems to be permafailing now.

      The pattern appears to be:

      : Operator upgrade machine-config expand_less 	0s
      {Failed to upgrade machine-config, operator was degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci-2022-10-10-233202: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)]  Failed to upgrade machine-config, operator was degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci-2022-10-10-233202: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)]}
      

      and

      disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success
      
      alert MCDPivotError fired for 1021 seconds with labels: {container="oauth-proxy", endpoint="metrics", err="error running systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host registry.ci.openshift.org/ocp/4.12-2022-10-10-233202@sha256:53afd52bd7920b4a1e9dd805a16e643937b406abe092508c37e7e089fa9a806f rpm-ostree ex deploy-from-self /run/host: Running as unit: machine-config-daemon-update-rpmostree-via-container.service\nFinished with result: exit-code\nMain processes terminated with: code=exited/status=1\nService runtime: 38.597s\nCPU time consumed: 1min 6.453s\n: exit status 1", exported_node="ci-op-rxh51fhz-6bb16-2qspl-worker-a-lswln", instance="10.0.128.3:9001", job="machine-config-daemon", namespace="openshift-machine-config-operator", node="ci-op-rxh51fhz-6bb16-2qspl-worker-a-lswln", pivot_target="registry.ci.openshift.org/ocp/4.12-2022-10-10-233202@sha256:53afd52bd7920b4a1e9dd805a16e643937b406abe092508c37e7e089fa9a806f", pod="machine-config-daemon-bv7bp", service="machine-config-daemon", severity="warning"}
      

      This MCDPivorError is probably very telling. Search.ci shows it happening primarily on GCP rt and for much longer durations than the occasional hits elsewhere.

      https://search.ci.openshift.org/?search=alert+MCDPivotError+fired&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=gcp&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-rt-upgrade

      Attachments

        Issue Links

          Activity

            People

              team-mco Team MCO
              rhn-engineering-dgoodwin Devan Goodwin
              Rio Liu Rio Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: