Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.11.z
Affects Version/s: 4.12.z
Component/s: Cloud Compute / OpenStack Provider
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:

4.11.z
Release Blocker:
None
Sprint:
ShiftStack Sprint 244, ShiftStack Sprint 245, ShiftStack Sprint 246
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

Release Note Status:
None
Release Note Type:
Bug Fix
Release Note Text:

Hide
Upgrade to 4.12 on OpenStack platform could fail when the master nodes were attached to additional networks due to a known race condition when switching from the in-tree cloud provider to the external cloud provider, as during the upgrade there is a short moment where both providers are active at the same time and could report different node IPs. The fix adds an annotation causing both provider to report the same primary node IP, preventing node IP flapping.

Show
Upgrade to 4.12 on OpenStack platform could fail when the master nodes were attached to additional networks due to a known race condition when switching from the in-tree cloud provider to the external cloud provider, as during the upgrade there is a short moment where both providers are active at the same time and could report different node IPs. The fix adds an annotation causing both provider to report the same primary node IP, preventing node IP flapping.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During the upgrade mutlijob for OCP starting from version 4.10 with OVNkubernetes network type on OSP 16.2, the upgrade process encountered an error when upgrading from version 4.11 to 4.12. The cluster operator etcd became unavailable. A specific node, ostest-ttvx4-master-2, is currently in SchedulingDisabled status. Examination of the openshift-etcd namespace reveals that the etcd-ostest-ttvx4-master-0 pod has been reporting errors. Log data suggests issues related to etcd members and their data directories.

Version-Release number of selected component (if applicable):

OCP 4.11.50 to 4.12.36
RHOS-16.2-RHEL-8-20230510.n.1

How reproducible:

Always

Steps to Reproduce:

1.Begin the OCP upgrade process starting from version 4.10
2.Upgrade from 4.10 to 4.11
3.Upgrade from 4.11 to 4.12

Actual results:

The upgrade process fails during the upgrading between versions 4.11 and 4.12, specifically pointing to issues with the etcd operator. The operator reports being unavailable and indicates problems with specific etcd members.

Expected results:

Smooth upgrade from 4.11 to 4.12 without any issues.

Additional info:

$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.36 True False True 4h17m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal 4.12.36 True False False 8h
...
...
csi-snapshot-controller 4.12.36 True False False 8h
dns 4.12.36 True False False 8h
etcd 4.12.36 False True True 4h33m EtcdMembersAvailable: 2 of 4 members are available, NAME-PENDING-172.17.5.228 has not started, ostest-ttvx4-master-0 is unhealthy
.....
machine-approver 4.12.36 True False False 8h
machine-config 4.11.50 True True True 6h14m Unable to apply 4.12.36: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]
marketplace 4.12.36 True False False 8h
monitoring 4.12.36 True False False 4h15m
network 4.12.36 True False False 8h
node-tuning 4.12.36 True False False 5h15m
openshift-apiserver 4.12.36 True False True 4h19m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
...
operator-lifecycle-manager-packageserver 4.12.36 True False False 8h
service-ca 4.12.36 True False False 8h
storage 4.12.36 True False False 8h

$ oc get pods -n openshift-etcd
NAME READY STATUS RESTARTS AGE
etcd-guard-ostest-ttvx4-master-0 0/1 Running 0 4h33m
etcd-guard-ostest-ttvx4-master-1 1/1 Running 0 4h22m
etcd-guard-ostest-ttvx4-master-2 1/1 Running 0 5h40m
etcd-ostest-ttvx4-master-0 3/4 Error 58 (5m10s ago) 4h25m
etcd-ostest-ttvx4-master-1 4/4 Running 0 4h25m
etcd-ostest-ttvx4-master-2 4/4 Running 2 (4h28m ago) 4h48m
installer-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
installer-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
installer-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
revision-pruner-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
revision-pruner-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
revision-pruner-26-ostest-ttvx4-master-1 0/1 Completed 0 4h34m
revision-pruner-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
revision-pruner-27-ostest-ttvx4-master-1 0/1 Completed 0 4h34m

$ oc logs etcd-ostest-ttvx4-master-0 -n openshift-etcd
1a4f2630e5f2296f, unstarted, , https://172.17.5.228:2380, , true
2f6c4ca331daa2de, started, ostest-ttvx4-master-2, https://10.196.2.249:2380, https://10.196.2.249:2379, false
752ca6c9953eff21, started, ostest-ttvx4-master-1, https://10.196.1.187:2380, https://10.196.1.187:2379, false
a6d1d802202a55e3, started, ostest-ttvx4-master-0, https://10.196.2.93:2380, https://10.196.2.93:2379, false
#### attempt 0
      member={name="", peerURLs=[https://172.17.5.228:2380}, clientURLs=[]
      member={name="ostest-ttvx4-master-2", peerURLs=[https://10.196.2.249:2380}, clientURLs=[https://10.196.2.249:2379]
      member={name="ostest-ttvx4-master-1", peerURLs=[https://10.196.1.187:2380}, clientURLs=[https://10.196.1.187:2379]
      member={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
      target={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
member "https://10.196.2.93:2380" dataDir has been destroyed and must be removed from the cluster

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2023-10-11-08-47-23-197.png
201 kB
2023/10/11 6:47 AM
image-2023-10-11-08-47-46-307.png
186 kB
2023/10/11 6:47 AM
image-2023-10-11-08-47-57-387.png
188 kB
2023/10/11 6:47 AM

depends on

OCPBUGS-23190 Failure on OCP Upgrade Between 4.11 to 4.12 Due to etcd Operator Issues

Closed

is cloned by

OCPBUGS-23190 Failure on OCP Upgrade Between 4.11 to 4.12 Due to etcd Operator Issues

Closed

links to

openshift/kubernetes#1796: OCPBUGS-20122: Make kubelet set alpha.kubernetes.io/provided-node-ip unconditionally

RHBA-2023:7691 OpenShift Container Platform 4.11.z bug fix update

Assignee:: Martin André

Reporter:: Yaakov Khodorkovski (Inactive)

QA Contact:: Yaakov Khodorkovski (Inactive)

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/10/05 8:21 AM

Updated:: 2025/07/25 5:49 AM

Resolved:: 2023/12/13 9:45 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates