Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: RHCOS
Labels:
- mco-triaged
- osintegration

Severity:
Critical
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description of problem:

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/44510/consoleFull

profile: 02_UPI_on_Baremetal-packet_OVN-dual-stack_Disk-encryption_Disk-mirroring_Etcd-encryption

wanted to upgrade from 4.12.41-x86_64 - > 4.13.19-x86_64,4.14.0-x86_64

4.12.41 upgrade to 4.13.19 is failed

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.19   True        False         True       172m    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.13.19   True        False         False      4h55m   
cloud-controller-manager                   4.13.19   True        False         False      4h58m   
cloud-credential                           4.13.19   True        False         False      4h59m   
cluster-autoscaler                         4.13.19   True        False         False      4h54m   
config-operator                            4.13.19   True        False         False      4h55m   
console                                    4.13.19   True        False         False      4h43m   
control-plane-machine-set                  4.13.19   True        False         False      4h55m   
csi-snapshot-controller                    4.13.19   True        False         False      4h55m   
dns                                        4.13.19   True        True          False      4h54m   DNS "default" reports Progressing=True: "Have 4 available node-resolver pods, want 6."
etcd                                       4.13.19   True        False         True       4h53m   ClusterMemberControllerDegraded: unhealthy members found during reconciling members...
image-registry                             4.13.19   True        True          False      175m    Progressing: The registry is ready...
ingress                                    4.13.19   True        False         False      4h44m   
insights                                   4.13.19   True        False         False      4h49m   
kube-apiserver                             4.13.19   True        False         True       4h50m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.13.19   True        False         True       4h52m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler                             4.13.19   True        False         True       4h52m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.13.19   True        False         False      89m     
machine-api                                4.13.19   True        False         False      4h55m   
machine-approver                           4.13.19   True        False         False      4h54m   
machine-config                             4.12.41   False       True          True       69m     Cluster not available for [{operator 4.12.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)]
marketplace                                4.13.19   True        False         False      4h54m   
monitoring                                 4.13.19   False       True          True       66m     reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: Too many daemonset pods are unavailable (2 > 1 max unavailable).
network                                    4.13.19   True        True          False      4h54m   DaemonSet "/openshift-multus/multus" is not available (awaiting 2 nodes)...
node-tuning                                4.13.19   True        True          False      121m    Working towards "4.13.19"
openshift-apiserver                        4.13.19   True        False         True       4h48m   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.13.19   True        False         False      4h51m   
openshift-samples                          4.13.19   True        False         False      121m    
operator-lifecycle-manager                 4.13.19   True        False         False      4h54m   
operator-lifecycle-manager-catalog         4.13.19   True        False         False      4h54m   
operator-lifecycle-manager-packageserver   4.13.19   True        False         False      4h48m   
service-ca                                 4.13.19   True        False         False      4h55m   
storage                                    4.13.19   True        False         False      4h55m

and mcp is stuck at updating

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-4abc568f4584b75998ea779c818f03c4   False     True       False      3              0                   0                     0                      4h56m
worker   rendered-worker-b9f20eedfe223ba9bb8cd926ea93db77   False     True       False      3              1                   1                     0                      4h56m
$ oc get node
NAME                                                 STATUS                        ROLES                  AGE     VERSION
master-00.juzhao-44510.qe.devcluster.openshift.com   NotReady,SchedulingDisabled   control-plane,master   4h59m   v1.25.14+31e0558
master-01.juzhao-44510.qe.devcluster.openshift.com   Ready                         control-plane,master   4h58m   v1.25.14+31e0558
master-02.juzhao-44510.qe.devcluster.openshift.com   Ready                         control-plane,master   4h58m   v1.25.14+31e0558
worker-00.juzhao-44510.qe.devcluster.openshift.com   Ready                         worker                 4h44m   v1.25.14+31e0558
worker-01.juzhao-44510.qe.devcluster.openshift.com   Ready                         worker                 4h46m   v1.26.9+636f2be
worker-02.juzhao-44510.qe.devcluster.openshift.com   NotReady,SchedulingDisabled   worker                 4h45m   v1.25.14+31e0558

clusterversion info

$ oc get clusterversion version -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2023-10-27T07:39:05Z"
  generation: 3
  name: version
  resourceVersion: "191974"
  uid: 6b3d8f94-8948-4756-a0f9-0b9fb39bb91d
spec:
  channel: stable-4.12
  clusterID: 0b4919c9-3b65-4c77-9e0c-cea5933494ca
  desiredUpdate:
    force: true
    image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
    version: ""
status:
  availableUpdates: null
  capabilities:
    enabledCapabilities:
    - CSISnapshot
    - Console
    - Insights
    - NodeTuning
    - Storage
    - baremetal
    - marketplace
    - openshift-samples
    knownCapabilities:
    - CSISnapshot
    - Console
    - Insights
    - NodeTuning
    - Storage
    - baremetal
    - marketplace
    - openshift-samples
  conditions:
  - lastTransitionTime: "2023-10-27T07:39:07Z"
    message: 'Unable to retrieve available updates: currently reconciling cluster
      version 4.13.19 not found in the "stable-4.12" channel'
    reason: VersionNotFound
    status: "False"
    type: RetrievedUpdates
  - lastTransitionTime: "2023-10-27T10:10:34Z"
    message: Capabilities match configured spec
    reason: AsExpected
    status: "False"
    type: ImplicitlyEnabledCapabilities
  - lastTransitionTime: "2023-10-27T07:39:07Z"
    message: Payload loaded version="4.13.19" image="quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a"
      architecture="amd64"
    reason: PayloadLoaded
    status: "True"
    type: ReleaseAccepted
  - lastTransitionTime: "2023-10-27T08:03:40Z"
    message: Done applying 4.12.41
    status: "True"
    type: Available
  - lastTransitionTime: "2023-10-27T11:15:58Z"
    message: Cluster operators etcd, kube-apiserver are degraded
    reason: ClusterOperatorsDegraded
    status: "True"
    type: Failing
  - lastTransitionTime: "2023-10-27T10:10:20Z"
    message: 'Unable to apply 4.13.19: wait has exceeded 40 minutes for these operators:
      etcd, kube-apiserver'
    reason: ClusterOperatorsDegraded
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-10-27T11:14:04Z"
    message: 'Cluster operator machine-config should not be upgraded between minor
      versions: One or more machine config pools are updating, please see `oc get
      mcp` for further details'
    reason: PoolUpdating
    status: "False"
    type: Upgradeable
  desired:
    image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
    url: https://access.redhat.com/errata/RHSA-2023:6130
    version: 4.13.19
  history:
  - acceptedRisks: |-
      Forced through blocking failures: Multiple precondition checks failed:
      * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
      * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.41 to 4.13.19 is unknown.
    completionTime: null
    image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
    startedTime: "2023-10-27T10:10:20Z"
    state: Partial
    verified: true
    version: 4.13.19
  - completionTime: "2023-10-27T08:03:40Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:59c93fdfff4ecca2ca6d6bb0ec722bca2bb08152252ae10ce486a9fc80c82dcf
    startedTime: "2023-10-27T07:39:07Z"
    state: Completed
    verified: false
    version: 4.12.41
  observedGeneration: 3
  versionHash: K84dirQ2oDM=

also found the same issue in prow job, https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.19 , click the "Failed" link of 4.12.41 to 4.13.19 upgrade, check its logs

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1717511870073540608/build-log.txt

Oct 26 14:31:54.647: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Oct 26 14:32:04.645: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Oct 26 14:32:14.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Oct 26 14:32:24.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Oct 26 14:32:24.832: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)

Version-Release number of selected component (if applicable):

4.12.41 to 4.13.19 upgrade

How reproducible:

not always

Steps to Reproduce:

1. 4.12.41 to 4.13.19 upgrade

Actual results:

failed upgrade

Expected results:

should be successful

Additional info:

there is successful upgrade from 4.11.52-x86_64 - > 4.12.41-x86_64,4.13.19-x86_64,4.14.0-x86_64
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/44490/

seems the issue is not always happen

mentioned on

Merge request - [4.10-4.15] [Packet] Fixing 'COPY_NETWORK' backported conditional typo

Assignee:: Unassigned

Reporter:: Junqi Zhao

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/10/27 12:46 PM

Updated:: 2024/04/18 3:52 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates