Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-925

[2109731] - alertmanager-main pods failing to start due to startupprobe timeout

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 4.11.0
    • 4.10
    • Monitoring
    • None
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None
    • Customer Facing

      Description of problem:
      During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).

      Logs from the alertmanager:

      level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
      level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"

      alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
      We tried to manually increase the timers in the startupprobe, but it was not possible.

      Version-Release number of selected component (if applicable):
      OCP 4.10.10

      How reproducible:
      OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.

      Actual results:
      CMO is not being deployed

      Expected results:
      CMO deploys without errors

      Additional info:

      • CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
      • cluster is 3 nodes with masters schedulable

              sthaha@redhat.com Sunil Thaha
              hongyli@redhat.com Hongyan Li
              Hongyan Li Hongyan Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: