Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-3194

x2a-convertor migrate agent enters infinite loop attempting to fix unfixable ansible-lint issues

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Summary

      The x2a-convertor migrate command enters an infinite loop during Ansible code generation when it encounters ansible-lint issues it cannot resolve. The LLM agent repeatedly rewrites the same file and re-runs ansible-lint, never succeeding, until it hits the LangGraph recursion limit of 500.

      Steps to Reproduce

      1. Clone the x2ansible chef-examples repo:
        git clone https://github.com/x2ansible/chef-examples.git
        cd chef-examples
        
      2. Run init and analyze for the nginx-multisite cookbook
      3. Run the migrate command:
        podman run --rm -ti \
          -v $(pwd)/:/app/source:Z \
          -e LLM_MODEL=anthropic.claude-3-7-sonnet-20250219-v1:0 \
          -e AWS_REGION=eu-west-2 \
          -e AWS_BEARER_TOKEN_BEDROCK=$AWS_BEARER_TOKEN_BEDROCK \
          quay.io/x2ansible/x2a-convertor:latest \
          migrate --source-dir /app/source/ --source-technology Chef \
          --high-level-migration-plan migration-plan.md \
          --module-migration-plan migration-plan-nginx-multisite.md \
          "Convert the 'nginx-multisite' module"
        

      Expected Behavior

      The migrate agent should detect when it has failed to fix the same ansible-lint issues after a reasonable number of retries (e.g., 3-5 attempts) and either:

      • Accept the code with known lint warnings and proceed
      • Report the unfixable issues and move on to the next task
      • Bail out gracefully with a clear error message

      Actual Behavior

      The agent gets stuck in a loop on ansible/roles/nginx_multisite/tasks/security.yml:

      1. Generates/rewrites the file using ansible_write tool
      2. Runs ansible_lint tool - finds 2 issues
      3. Rewrites the file to try to fix the issues
      4. Runs ansible_lint again - still finds 2 issues
      5. Repeats this cycle hundreds of times
      6. Eventually hits the LangGraph recursion limit of 500:
        Error: Recursion limit of 500 reached without hitting a stop condition.
        For troubleshooting, visit: https://docs.langchain.com/oss/python/langgraph/errors/GRAPH_RECURSION_LIMIT
        

      Additionally, ansible-lint itself is throwing internal errors during each iteration:

      ERROR:ansiblelint.skip_utils: Error trying to append skipped rules
      RuntimeError: Error in matching skip comment to a task
      

      This suggests the lint issue may be caused by an ansible-lint bug with skip comments, making it impossible for the LLM to fix by rewriting the Ansible code.

      Impact

      • Wasted LLM API credits: Hundreds of Bedrock API calls for the same failing operation
      • CI pipeline failure: The Jenkins job fails after running for an extended period
      • No usable output: The migration does not complete, so no Ansible code is produced for the module

      Environment

      • x2a-convertor image: quay.io/x2ansible/x2a-convertor:latest
      • LLM Model: anthropic.claude-3-7-sonnet-20250219-v1:0
      • Source cookbook: nginx-multisite from https://github.com/x2ansible/chef-examples.git
      • Jenkins job: flightpath-x2ansible build #7

      Suggested Fix

      Add a maximum retry counter for lint-fix cycles in the migrate agent. After N failed attempts to fix the same lint issues on the same file, the agent should move on rather than retrying indefinitely.

              eloycoto Eloy Coto
              gharden1 Gary Harden
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: