-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Product / Portfolio Work
-
3
-
False
-
-
False
-
ToDo
-
-
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
None
Issue Summary:
This story addresses a critical performance regression identified in upstream Velero that affects OADP backup operations. Users are experiencing dramatic performance degradation where backups that previously took 30 minutes now take 6+ hours for the same workload.
Upstream Issue: https://github.com/vmware-tanzu/velero/issues/9169
Problem Description:
- Velero v1.11.1: ~300k objects backed up in ~30 minutes (CPU: 1 core, Memory: 3Gi)
- Velero v1.16.2: Same 300k objects now take ~6 hours (CPU: 3.5 cores, Memory: 4.5Gi)
- Performance starts fast (~5k objects in seconds) then drops to ~3 objects/sec
- Resource increases and configuration tuning have not resolved the issue
Configuration Details:
- Snapshots and filesystem backup disabled
- Backup schedule: Daily at 4 AM
- Includes all namespaces and resources
- Storage location: default
- TTL: 888h0m0s
Attempted Mitigations (unsuccessful):
- Increased resource requests to 4 cores/6Gi
- Increased clientPageSize to 700
- Increased itemBlockWorkerCount to 5
- Increased clientQPS to 100
- Increased clientBurst to 100
- Increased uploaderConfig.parallelFilesUpload to 30
Impact on OADP:
This performance regression directly affects OADP users running similar backup workloads and needs investigation for the 1.6.0 release to ensure optimal backup performance.
Environment:
- Kubernetes: v1.32.4-gke.1767000
- Cloud: Google Cloud GKE
- OS: Container-Optimized OS from Google
Acceptance Criteria:
- Investigate the root cause of performance degradation in newer Velero versions
- Identify if this affects OADP's Velero integration
- Implement fixes or workarounds for OADP 1.6.0 if needed
- Ensure backup performance meets acceptable standards for large workloads
- Document any configuration recommendations for optimal performance