-
Bug
-
Resolution: Done
-
Major
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
2
-
Moderate
-
None
-
None
-
None
-
None
-
OSDOCS Sprint 267, OSDOCS Sprint 268
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Even if you restore etcd using the document procedure, it will fail.
Version-Release number of selected component (if applicable):
4.6.11
How reproducible:
Restoring data that has been backed up for one day will fail.
Steps to Reproduce:
1) Used the procedure described in Reference Material [1] to obtain an etcd backup 2) After the backup was taken, I immediately used the procedure described in Reference Material [2] to restore it, and was able to complete the restore procedure without any errors 3) Using the backup taken in 1), I restored it using the same procedure as above 2), but with at least one day between the backup and the restore. Reference information[4] In the Restore procedure.txt, just before step 16, one of the ovnkube-node nodes became stuck in the Terminating state. ~~~ [core@default-5fpfh-master-0 ~]$ oc -n openshift-ovn-kubernetes get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES... ovnkube-node-jcdfn 8/8 Terminating 0 2d22h 10.0.2.0 default-5fpfh-worker-0-nhv4x <none> <none>... ~~~ 4) Resolve the Terminating state of ovnkube-node by following the procedure in Reference [3] as described in CASE 03674178, and resume the restore from step 16 5) An error occurred while monitoring the Platform Operator in step 23 of the Restore procedure.txt in Reference [4]. ~~~ [core@default-5fpfh-master-0 ~]$ oc adm wait-for-stable-cluster clusteroperators/machine-config degraded at 2024-11-14T06:37:47Z clusteroperators/monitoring is unavailable, in progress, and degraded at 2024-11-14T06:37:47Z clusteroperators/network is still in progress at 2024-11-14T06:37:47Zclusteroperators/machine-config is still in a degraded state after 59m50s clusteroperators/monitoring is still in a degraded state after 59m50s, and is still unavailable and in progress clusteroperators/network is still in progress after 59m50sError: Waiting for timeout 【Request】 1) Please tell us how to restore etcd successfully using the etcd backup data that is more than one day old.【Version Information】 OCP 4.16.11[Reference information] [1] 5.1.1. etcd data backup https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/backup_and_restore/control-plane-backup-and-restore#backing-up-etcd-data_backup-etcd [2] 5.3.2.2. Restoring the Cluster to its Previous State https://docs.openshift.com/container-platform/4.16/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state [3] ETCD restore process OVN network Pod hangs with exit status https://access.redhat.com/mt/ja/solutions/7034476[4] Reference material Results of executing based on [2 Restore procedure.txtDescribe the impact to you or the business Can you present backup operations as an operational procedure manual to the business company?In what environment are you experiencing this behavior? Offline cluster of OCP 4.16.11 built in OSP environment (default configuration: master x3, worker x2)How frequently does this behavior occur? Does it occur repeatedly or at certain times?
Actual results:
[Current status]
After taking a backup, the restoration using the backup that has passed more than one day has failed, and even if [3] kcs is implemented, some of the ClusterOperators and Pods do not start normally
[core@default-5fpfh-master-0 ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.11 True False False 100m
cloud-controller-manager 4.16.11 True False False 8d
cloud-credential 4.16.11 True False False 8d
cluster-autoscaler 4.16.11 True False False 8d
config-operator 4.16.11 True False False 8d
console 4.16.11 True False False 8d
control-plane-machine-set 4.16.11 True False False 8d
dns 4.16.11 True False False 8d
etcd 4.16.11 True False False 8d
image-registry 4.16.11 True False False 98m
ingress 4.16.11 True False False 8d
kube-apiserver 4.16.11 True False False 8d
kube-controller-manager 4.16.11 True False False 8d
kube-scheduler 4.16.11 True False False 8d
kube-storage-version-migrator 4.16.11 True False False 8d
machine-api 4.16.11 True False False 8d
machine-approver 4.16.11 True False False 8d
machine-config 4.16.11 True False True 8d Resync failed. 4.16.11: An error occurred during synchronization. RequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]
marketplace 4.16.11 True False False 8d
Monitoring 4.16.11 False True True 2m10s UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
Network 4.16.11 True True False 8d DaemonSet “/openshift-multus/network-metrics-daemon” is not available (1 node waiting)...
openshift-apiserver 4.16.11 True False False 100m
openshift-controller-manager 4.16.11 True False False 8d
operator-lifecycle-manager 4.16.11 True False False 8d
operator-lifecycle-manager-catalog 4.16.11 True False False 8d
operator-lifecycle-manager-packageserver 4.16.11 True False False 8d
service-ca 4.16.11 True False False 8d
storage 4.16.11 True False False 8d
[core@default-5fpfh-master-0 ~]$ oc get pod -A -o wide | grep -v -e Running -e Completed
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openshift-dns dns-default-snsk2 0/2 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-ingress-canary ingress-canary-l97q5 0/1 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring alertmanager-main-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring metrics-server-67f5fc4cb7-4hwqn 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring monitoring-plugin-55655946c6-6hsqz 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring prometheus-operator-admission-webhook-76776d7749-84cbv 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-monitoring thanos-querier-6fcdfd8d4-zlhqd 0/6 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-multus network-metrics-daemon-7fdll 0/2 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
openshift-network-diagnostics
NAME READY STATUS RESTARTS AGE etcd-default-5fpfh-master-0 4/4 Running 0 82m etcd-default-5fpfh-master-1 4/
NAME READY STATUS RESTARTS AGE
etcd-default-5fpfh-master-0 4/4 Running 0 82m
etcd-default-5fpfh-master-1 4/4 Running 0 80m
etcd-default-5fpfh-master-2 4/4 Running 0 79m
[core@default-5fpfh-master-0 ~]${code}
Expected results:
{code:none}
Successfully restoring etcd
Additional info:
If you include the versions before 4.16 that have been implemented so far, it has occurred 100% since OVN-Kubernetes was adopted as CNI.