Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49434

etcd restore fails in the official documentation procedure on the OCP cluster.

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • Moderate
    • None
    • None
    • None
    • None
    • OSDOCS Sprint 267, OSDOCS Sprint 268
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Even if you restore etcd using the document procedure, it will fail.

      Version-Release number of selected component (if applicable):

      4.6.11

      How reproducible:

      Restoring data that has been backed up for one day will fail.
      

      Steps to Reproduce:

      1) Used the procedure described in Reference Material [1] to obtain an etcd backup
      2) After the backup was taken, I immediately used the procedure described in Reference Material [2] to restore it, and was able to
      complete the restore procedure without any errors
      3) Using the backup taken in 1), I restored it using the same procedure as above 2), but with at least one day between the backup and the restore.
      Reference information[4] In the Restore procedure.txt, just before step 16, one of the ovnkube-node nodes became stuck in the Terminating state.
      
      ~~~
      [core@default-5fpfh-master-0 ~]$ oc -n openshift-ovn-kubernetes get pod -o wide
      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES...
      ovnkube-node-jcdfn 8/8 Terminating 0 2d22h 10.0.2.0 default-5fpfh-worker-0-nhv4x <none> <none>...
      ~~~
      
      4) Resolve the Terminating state of ovnkube-node by following the procedure in Reference [3] as described in CASE 03674178, and resume the restore from step 16
      5) An error occurred while monitoring the Platform Operator in step 23 of the Restore procedure.txt in Reference [4].
      ~~~
      
      [core@default-5fpfh-master-0 ~]$ oc adm wait-for-stable-cluster
      clusteroperators/machine-config degraded at 2024-11-14T06:37:47Z
      clusteroperators/monitoring is unavailable, in progress, and degraded at 2024-11-14T06:37:47Z
      clusteroperators/network is still in progress at 2024-11-14T06:37:47Zclusteroperators/machine-config is still in a degraded state after 59m50s
      clusteroperators/monitoring is still in a degraded state after 59m50s, and is still unavailable and in progress
      clusteroperators/network is still in progress after 59m50sError: Waiting for timeout
      
        
      
      
      【Request】
      1) Please tell us how to restore etcd successfully using the etcd backup data that is more than one day old.【Version Information】
      OCP 4.16.11[Reference information]
      
      [1] 5.1.1. etcd data backup
      https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/backup_and_restore/control-plane-backup-and-restore#backing-up-etcd-data_backup-etcd
      
      [2] 5.3.2.2. Restoring the Cluster to its Previous State
      https://docs.openshift.com/container-platform/4.16/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state
      
      [3] ETCD restore process OVN network Pod hangs with exit status
      https://access.redhat.com/mt/ja/solutions/7034476[4] Reference material Results of executing based on [2
      Restore procedure.txtDescribe the impact to you or the business
      Can you present backup operations as an operational procedure manual to the business company?In what environment are you experiencing this behavior?
      Offline cluster of OCP 4.16.11 built in OSP environment (default configuration: master x3, worker x2)How frequently does this behavior occur? Does it occur repeatedly or at certain times?
      
      
      

      Actual results:

      [Current status] 
      
      After taking a backup, the restoration using the backup that has passed more than one day has failed, and even if [3] kcs is implemented, some of the ClusterOperators and Pods do not start normally
      
      [core@default-5fpfh-master-0 ~]$ oc get co 
      NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.11 True False False 100m 
      cloud-controller-manager 4.16.11 True False False 8d 
      cloud-credential 4.16.11 True False False 8d 
      cluster-autoscaler 4.16.11 True False False 8d 
      config-operator 4.16.11 True False False 8d 
      console 4.16.11 True False False 8d 
      control-plane-machine-set 4.16.11 True False False 8d 
      dns 4.16.11 True False False 8d 
      etcd 4.16.11 True False False 8d 
      image-registry 4.16.11 True False False 98m 
      ingress 4.16.11 True False False 8d 
      kube-apiserver 4.16.11 True False False 8d 
      kube-controller-manager 4.16.11 True False False 8d 
      kube-scheduler 4.16.11 True False False 8d 
      kube-storage-version-migrator 4.16.11 True False False 8d 
      machine-api 4.16.11 True False False 8d 
      machine-approver 4.16.11 True False False 8d 
      machine-config 4.16.11 True False True 8d Resync failed. 4.16.11: An error occurred during synchronization. RequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]
      marketplace 4.16.11 True False False 8d
      Monitoring 4.16.11 False True True 2m10s UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
      Network 4.16.11 True True False 8d DaemonSet “/openshift-multus/network-metrics-daemon” is not available (1 node waiting)...
      openshift-apiserver 4.16.11 True False False 100m
      openshift-controller-manager 4.16.11 True False False 8d
      operator-lifecycle-manager 4.16.11 True False False 8d
      operator-lifecycle-manager-catalog 4.16.11 True False False 8d
      operator-lifecycle-manager-packageserver 4.16.11 True False False 8d
      service-ca 4.16.11 True False False 8d
      storage 4.16.11 True False False 8d
      [core@default-5fpfh-master-0 ~]$ oc get pod -A -o wide | grep -v -e Running -e Completed
      NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      openshift-dns dns-default-snsk2 0/2 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-ingress-canary ingress-canary-l97q5 0/1 ContainerCreating 0 106m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring alertmanager-main-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring metrics-server-67f5fc4cb7-4hwqn 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring monitoring-plugin-55655946c6-6hsqz 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring prometheus-operator-admission-webhook-76776d7749-84cbv 0/1 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-monitoring thanos-querier-6fcdfd8d4-zlhqd 0/6 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-multus network-metrics-daemon-7fdll 0/2 ContainerCreating 0 107m <none> default-5fpfh-worker-0-nhv4x <none> <none>
      openshift-network-diagnostics
      NAME READY STATUS RESTARTS AGE etcd-default-5fpfh-master-0 4/4 Running 0 82m etcd-default-5fpfh-master-1 4/
      NAME READY STATUS RESTARTS AGE
      etcd-default-5fpfh-master-0 4/4 Running 0 82m
      etcd-default-5fpfh-master-1 4/4 Running 0 80m
      etcd-default-5fpfh-master-2 4/4 Running 0 79m
      [core@default-5fpfh-master-0 ~]${code}
      Expected results:
      {code:none}
      Successfully restoring etcd
      

      Additional info:

      If you include the versions before 4.16 that have been implemented so far, it has occurred 100% since OVN-Kubernetes was adopted as CNI.    

              rhn-support-lahinson Laura Hinson
              rhn-support-fkawakub Futoshi Kawakubo
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: