-
Bug
-
Resolution: Done
-
Major
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
2
-
Moderate
-
None
-
None
-
None
-
None
-
OSDOCS Sprint 267, OSDOCS Sprint 268
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Even if you restore etcd using the document procedure, it will fail.
Version-Release number of selected component (if applicable):
4.6.11
How reproducible:
Restoring data that has been backed up for one day will fail.
Steps to Reproduce:
1) Used the procedure described in Reference Material [1] to obtain an etcd backup 2) After the backup was taken, I immediately used the procedure described in Reference Material [2] to restore it, and was able to complete the restore procedure without any errors 3) Using the backup taken in 1), I restored it using the same procedure as above 2), but with at least one day between the backup and the restore. Reference information[4] In the Restore procedure.txt, just before step 16, one of the ovnkube-node nodes became stuck in the Terminating state. ~~~ [core@default-5fpfh-master-0 ~]$ oc -n openshift-ovn-kubernetes get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES... ovnkube-node-jcdfn 8/8 Terminating 0 2d22h 10.0.2.0 default-5fpfh-worker-0-nhv4x <none> <none>... ~~~ 4) Resolve the Terminating state of ovnkube-node by following the procedure in Reference [3] as described in CASE 03674178, and resume the restore from step 16 5) An error occurred while monitoring the Platform Operator in step 23 of the Restore procedure.txt in Reference [4]. ~~~ [core@default-5fpfh-master-0 ~]$ oc adm wait-for-stable-cluster clusteroperators/machine-config degraded at 2024-11-14T06:37:47Z clusteroperators/monitoring is unavailable, in progress, and degraded at 2024-11-14T06:37:47Z clusteroperators/network is still in progress at 2024-11-14T06:37:47Zclusteroperators/machine-config is still in a degraded state after 59m50s clusteroperators/monitoring is still in a degraded state after 59m50s, and is still unavailable and in progress clusteroperators/network is still in progress after 59m50sError: Waiting for timeout 【Request】 1) Please tell us how to restore etcd successfully using the etcd backup data that is more than one day old.【Version Information】 OCP 4.16.11[Reference information] [1] 5.1.1. etcd data backup https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/backup_and_restore/control-plane-backup-and-restore#backing-up-etcd-data_backup-etcd [2] 5.3.2.2. Restoring the Cluster to its Previous State https://docs.openshift.com/container-platform/4.16/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state [3] ETCD restore process OVN network Pod hangs with exit status https://access.redhat.com/mt/ja/solutions/7034476[4] Reference material Results of executing based on [2 Restore procedure.txtDescribe the impact to you or the business Can you present backup operations as an operational procedure manual to the business company?In what environment are you experiencing this behavior? Offline cluster of OCP 4.16.11 built in OSP environment (default configuration: master x3, worker x2)How frequently does this behavior occur? Does it occur repeatedly or at certain times?
Actual results:
[Current status] After taking a backup, the restoration using the backup that has passed more than one day has failed, and even if [3] kcs is implemented, some of the ClusterOperators and Pods do not start normally [core@default-5fpfh-master-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.11 True False False 100m cloud-controller-manager 4.16.11 True False False 8d cloud-credential 4.16.11 True False False 8d cluster-autoscaler 4.16.11 True False False 8d config-operator 4.16.11 True False False 8d console 4.16.11 True False False 8d control-plane-machine-set 4.16.11 True False False 8d dns 4.16.11 True False False 8d etcd 4.16.11 True False False 8d image-registry 4.16.11 True False False 98m ingress 4.16.11 True False False 8d kube-apiserver 4.16.11 True False False 8d kube-controller-manager 4.16.11 True False False 8d kube-scheduler 4.16.11 True False False 8d kube-storage-version-migrator 4.16.11 True False False 8d machine-api 4.16.11 True False False 8d machine-approver 4.16.11 True False False 8d machine-config 4.16.11 True False True 8d Resync failed. 4.16.11: An error occurred during synchronization. RequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]] marketplace 4.16.11 True False False 8d Monitoring 4.16.11 False True True 2m10s UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline Network 4.16.11 True True False 8d DaemonSet “/openshift-multus/network-metrics-daemon” is not available (1 node waiting)... openshift-apiserver 4.16.11 True False False 100m openshift-controller-manager 4.16.11 True False False 8d operator-lifecycle-manager 4.16.11 True False False 8d operator-lifecycle-manager-catalog 4.16.11 True False False 8d operator-lifecycle-manager-packageserver 4.16.11 True False False 8d service-ca 4.16.11 True False False 8d storage 4.16.11 True False False 8d [core@default-5fpfh-master-0 ~]$ oc get pod -A -o wide | grep -v -e Running -e Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-dns dns-default-snsk2 0/2 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-ingress-canary ingress-canary-l97q5 0/1 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring alertmanager-main-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring metrics-server-67f5fc4cb7-4hwqn 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring monitoring-plugin-55655946c6-6hsqz 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring prometheus-operator-admission-webhook-76776d7749-84cbv 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-monitoring thanos-querier-6fcdfd8d4-zlhqd 0/6 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-multus network-metrics-daemon-7fdll 0/2 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none> openshift-network-diagnostics NAME READY STATUS RESTARTS AGE etcd-default-5fpfh-master-0 4/4 Running 0 82m etcd-default-5fpfh-master-1 4/ NAME READY STATUS RESTARTS AGE etcd-default-5fpfh-master-0 4/4 Running 0 82m etcd-default-5fpfh-master-1 4/4 Running 0 80m etcd-default-5fpfh-master-2 4/4 Running 0 79m [core@default-5fpfh-master-0 ~]${code} Expected results: {code:none} Successfully restoring etcd
Additional info:
If you include the versions before 4.16 that have been implemented so far, it has occurred 100% since OVN-Kubernetes was adopted as CNI.