Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Documentation / etcd
Labels:
- TPSDocs:Triaged
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
OSDOCS Sprint 267, OSDOCS Sprint 268
sprint_count:
2

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Even if you restore etcd using the document procedure, it will fail.

Version-Release number of selected component (if applicable):

4.6.11

How reproducible:

Restoring data that has been backed up for one day will fail.

Steps to Reproduce:

1) Used the procedure described in Reference Material [1] to obtain an etcd backup
2) After the backup was taken, I immediately used the procedure described in Reference Material [2] to restore it, and was able to
complete the restore procedure without any errors
3) Using the backup taken in 1), I restored it using the same procedure as above 2), but with at least one day between the backup and the restore.
Reference information[4] In the Restore procedure.txt, just before step 16, one of the ovnkube-node nodes became stuck in the Terminating state.

~~~
[core@default-5fpfh-master-0 ~]$ oc -n openshift-ovn-kubernetes get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES...
ovnkube-node-jcdfn 8/8 Terminating 0 2d22h 10.0.2.0 default-5fpfh-worker-0-nhv4x <none> <none>...
~~~

4) Resolve the Terminating state of ovnkube-node by following the procedure in Reference [3] as described in CASE 03674178, and resume the restore from step 16
5) An error occurred while monitoring the Platform Operator in step 23 of the Restore procedure.txt in Reference [4].
~~~

[core@default-5fpfh-master-0 ~]$ oc adm wait-for-stable-cluster
clusteroperators/machine-config degraded at 2024-11-14T06:37:47Z
clusteroperators/monitoring is unavailable, in progress, and degraded at 2024-11-14T06:37:47Z
clusteroperators/network is still in progress at 2024-11-14T06:37:47Zclusteroperators/machine-config is still in a degraded state after 59m50s
clusteroperators/monitoring is still in a degraded state after 59m50s, and is still unavailable and in progress
clusteroperators/network is still in progress after 59m50sError: Waiting for timeout

  


【Request】
1) Please tell us how to restore etcd successfully using the etcd backup data that is more than one day old.【Version Information】
OCP 4.16.11[Reference information]

[1] 5.1.1. etcd data backup
https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/backup_and_restore/control-plane-backup-and-restore#backing-up-etcd-data_backup-etcd

[2] 5.3.2.2. Restoring the Cluster to its Previous State
https://docs.openshift.com/container-platform/4.16/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state

[3] ETCD restore process OVN network Pod hangs with exit status
https://access.redhat.com/mt/ja/solutions/7034476[4] Reference material Results of executing based on [2
Restore procedure.txtDescribe the impact to you or the business
Can you present backup operations as an operational procedure manual to the business company?In what environment are you experiencing this behavior?
Offline cluster of OCP 4.16.11 built in OSP environment (default configuration: master x3, worker x2)How frequently does this behavior occur? Does it occur repeatedly or at certain times?

Actual results:

[Current status] 

After taking a backup, the restoration using the backup that has passed more than one day has failed, and even if [3] kcs is implemented, some of the ClusterOperators and Pods do not start normally

[core@default-5fpfh-master-0 ~]$ oc get co 
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.11 True False False 100m 
cloud-controller-manager 4.16.11 True False False 8d 
cloud-credential 4.16.11 True False False 8d 
cluster-autoscaler 4.16.11 True False False 8d 
config-operator 4.16.11 True False False 8d 
console 4.16.11 True False False 8d 
control-plane-machine-set 4.16.11 True False False 8d 
dns 4.16.11 True False False 8d 
etcd 4.16.11 True False False 8d 
image-registry 4.16.11 True False False 98m 
ingress 4.16.11 True False False 8d 
kube-apiserver 4.16.11 True False False 8d 
kube-controller-manager 4.16.11 True False False 8d 
kube-scheduler 4.16.11 True False False 8d 
kube-storage-version-migrator 4.16.11 True False False 8d 
machine-api 4.16.11 True False False 8d 
machine-approver 4.16.11 True False False 8d 
machine-config 4.16.11 True False True 8d Resync failed. 4.16.11: An error occurred during synchronization. RequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]
marketplace 4.16.11 True False False 8d
Monitoring 4.16.11 False True True 2m10s UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
Network 4.16.11 True True False 8d DaemonSet “/openshift-multus/network-metrics-daemon” is not available (1 node waiting)...
openshift-apiserver 4.16.11 True False False 100m
openshift-controller-manager 4.16.11 True False False 8d
operator-lifecycle-manager 4.16.11 True False False 8d
operator-lifecycle-manager-catalog 4.16.11 True False False 8d
operator-lifecycle-manager-packageserver 4.16.11 True False False 8d
service-ca 4.16.11 True False False 8d
storage 4.16.11 True False False 8d
[core@default-5fpfh-master-0 ~]$ oc get pod -A -o wide | grep -v -e Running -e Completed
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openshift-dns dns-default-snsk2 0/2 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-ingress-canary ingress-canary-l97q5 0/1 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring alertmanager-main-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring metrics-server-67f5fc4cb7-4hwqn 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring monitoring-plugin-55655946c6-6hsqz 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring prometheus-operator-admission-webhook-76776d7749-84cbv 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring thanos-querier-6fcdfd8d4-zlhqd 0/6 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-multus network-metrics-daemon-7fdll 0/2 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-network-diagnostics
NAME READY STATUS RESTARTS AGE etcd-default-5fpfh-master-0 4/4 Running 0 82m etcd-default-5fpfh-master-1 4/
NAME READY STATUS RESTARTS AGE
etcd-default-5fpfh-master-0 4/4 Running 0 82m
etcd-default-5fpfh-master-1 4/4 Running 0 80m
etcd-default-5fpfh-master-2 4/4 Running 0 79m
[core@default-5fpfh-master-0 ~]${code}
Expected results:
{code:none}
Successfully restoring etcd

Additional info:

If you include the versions before 4.16 that have been implemented so far, it has occurred 100% since OVN-Kubernetes was adopted as CNI.

links to

ETCD restore process OVN network pods hang in the Terminating status

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide