Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-6287

Openshift GitOps server pod is crashing

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Before this update,
      Argo CD components—server, repo-server, and application controller—that access the Redis instance may crash due to a network or DNS instability within the cluster. This was due to a race condition in the go-redis client library when multiple connections in a connection pool call the dial hook function.
      This update with an updated go-redis client library, fixes the issue by ensuring that there are no race conditions when calling the dial hook function and ensures graceful handling and recovery of any network/DNS errors in the cluster.
      Show
      Before this update, Argo CD components—server, repo-server, and application controller—that access the Redis instance may crash due to a network or DNS instability within the cluster. This was due to a race condition in the go-redis client library when multiple connections in a connection pool call the dial hook function. This update with an updated go-redis client library, fixes the issue by ensuring that there are no race conditions when calling the dial hook function and ensures graceful handling and recovery of any network/DNS errors in the cluster.

      Description of Problem

      openshift-gitops-server pod is crashing and we are observing below error from logs:

      2025-02-20T14:36:29.472639108+05:30 fatal error: sync: unlock of unlocked mutex 

      The issue seems to be related to upstream issue reported here at : https://github.com/argoproj/argo-cd/issues/20824

       

      Slack thread for more context: https://redhat-internal.slack.com/archives/CMP95ST2N/p1739866014459279 

      Additional Info

      • <Any additional info such as logs, must-gather outputs, etc.>

      Problem Reproduction

      • <How do we reproduce the problem?>

      Reproducibility

      • <Always/Intermittent/Only Once>

      Prerequisites/Environment

      • <OpenShift, managed service (e.g., ROSA, ARO), operators, layered product, and other software versions, build details>

      Steps to Reproduce

      • ...

      Expected Results

      • ...

      Actual Results

      • ...

      Problem Analysis

      • <Completed by engineering team as part of the triage/refinement process>

      Root Cause

      • <What is the root cause of the problem? Or, why is it not a bug?>

      Workaround (If Possible)

      • <Are there any workarounds we can provide to the customers?>

      Fix Approaches

      • <If we decide to fix this bug, how will we do it?>

      Acceptance Criteria

      • ...

      Definition of Done

      • Code Complete:
        • All code has been written, reviewed, and approved.
      • Tested:
        • Unit tests have been written and passed.
        • Ensure code coverage is not reduced with the changes.
        • Integration tests have been automated.
        • System tests have been conducted, and all critical bugs have been fixed.
        • Tested and merged on OpenShift either upstream or downstream on a local build.
      • Documentation:
        • User documentation or release notes have been written (if applicable).
      • Build:
        • Code has been successfully built and integrated into the main repository / project.
        • Midstream changes (if applicable) are done, reviewed, approved and merged.
      • Review:
        • Code has been peer-reviewed and meets coding standards.
        • All acceptance criteria defined in the user story have been met.
        • Tested by reviewer on OpenShift.
      • Deployment:
        • The feature has been deployed on OpenShift cluster for testing.

              rh-ee-anjoseph Anand Francis Joseph
              rhn-support-jyarora Jyotsana Arora
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: