Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-5555

Hosted Cluster API Certificate remains in a failed state for an hour

XMLWordPrintable

    • False
    • None
    • False
    • Critical
    • No

      We have seen two cases so far where a hosted cluster's cluster-api-cert certificate gets stuck in a Ready: False state with this error in the cert-manager logs:

      E0518 12:55:37.906768       1 sync.go:545] cert-manager/orders "msg"="failed to finalize Order resource due to bad request, marking Order as failed" "error"="404 urn:ietf:params:acme:error:malformed: Certificate not found" "resource_kind"="Order" "resource_name"="cluster-api-cert-fbjsx-23424524" "resource_namespace"="ocm-staging-23pt21ffpvgu3hrl4q9fr0vhme8jdsoh" "resource_version"="v1" 

      and then a re-issue is triggered one hour later:

      I0518 13:55:37.000634       1 trigger_controller.go:200] cert-manager/certificates-trigger "msg"="Certificate must be re-issued" "key"="ocm-staging-23pt21ffpvgu3hrl4q9fr0vhme8jdsoh/cluster-api-cert" "message"="Issuing certificate as Secret does not exist" "reason"="DoesNotExist" 

      Ideally it never fails, but if it does fail we should look into the possibility of retrying sooner as our SLO for install-time is 10 minutes total.

       

      Ref:

      1. First case: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1684336046316879
      2. Second case: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1684417470777779 

              cdoan@redhat.com Christopher Doan
              mshen.openshift Michael Shen
              David Huynh David Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: