-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.17.z
-
None
-
CFE Sprint 260
-
1
-
False
-
Affected version is not relevant to my knowledge as the operator is not part of the OCP release cycle.
Description of problem:
We're using a DNS-01 clusterissuer (letsencrypt) where the _acme recordsets are created in route53 for the certificate creation. Certmanager repeatedly runs in the following error:
E0518 16:10:22.720047 1 controller.go:167] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="Time limit exceeded. Last error: " "key"="ocm-production-id/cluster-api-cert-zxkfj-2403688073-3486302731
This error comes from here in certmanager-operator. This happens when the _acme record change doesn't transition to InSync within 2 minutes, see here. If this fails, a new ChangeResourceSet will be triggered, with a new change ID - we're not even checking if the record exists from a previous change, we just create new changes over and over without looking back on the old ones to become eventually consistent. We should likely be calling ChangeResourceSet only once and continue checking the change for a configurable amount of time (currently hardcoded to 2 minutes).
Version-Release number of selected component (if applicable):
app.kubernetes.io/name=cert-manager app.kubernetes.io/version=v1.11.4
How reproducible:
Intermittent, depends on AWS's time for changes to become INSYNC
Steps to Reproduce:
1. 2. 3.
Actual results:
Certificate creation sometimes takes > 30 minutes because we create new changes instead of waiting for the initial change to complete.
Expected results:
Certificate creation is delayed only by the time it takes for an initial record creation to become INSYNC.
Additional info: