Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-7253

Argo CD Agent: Chaos Testing for development/test environments

XMLWordPrintable

    • Chaos Testing Methodologies for development/test environments
    • XL
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • SECFLOWOTL-145 - Multi-cluster: chaos testing methodologies
    • 0% To Do, 100% In Progress, 0% Done

      Epic Goal

      • The parent feature asks us to 'integrate tests that simulate network conditions which we want to be resilient against.'
      • This epic focuses on the ability to enable and run these simulated network conditions are part of test/development process.
      • Identify code changes or tools we can use to simulate unreliable network/node conditions
      • Ability to enable chaos testing configuration on argo agent running in development/test environment.
        • NOTE: IMHO we don't need to have chaos testing including as part of the tests that run as part of PRs.
      • Ability to run Argo CD agent E2E tests with chaos testing configuration enabled.
        • For example: A great way to ensure that argo cd agent is resilient, is to enable chaotic environment (for example, dropping connections randomly every 30 seconds), and then run E2E tests on loop to see that they still pass.
      • New E2E tests for any new scenarios that we want to cover.
      • Out of scope:
        • Out of scope for this epic: providing guidelines for running in production. The parent feature currently also requests 'metrics and best-practice recommendations for various networking scenarios'. This can be handled by a separate epic.
          • But hopefully the work we do as part of this epic will help inform that.

      Scenarios

      Brainstorming some types of unreliability we can implement...

      Unreliable TCP-IP connections:

      • Dropping TCP-IP connections every X seconds.
      • Drop connections between:
        • principal <-> agent (autonomous)
        • principal <-> agent (managed)
        • principal <-> k8s API
        • agent <-> k8s API
        • principal <-> argo cd redis
        • agent <-> argo cd redis

      Fully disconnected connection:

      • Fully disconnect principal/agent for several minutes, then reconnect.
      • On reconnect, the principal/agent should always move back into sync.

      Unreliable process/node:

      • Every X seconds, restart the container running argo cd agent, or argo cd
      • For example:
        • Randomly every X seconds, restart principal while autonomous agent is still running.
        • Randomly every X seconds, restart autonomous while principal is still running.
        • Randomly every X seconds, restart managed-agent while principal is still running.

      Brainstorming potential tools we can use to implement this.

      Definition of Ready

      • The epic has been broken down into stories.
      • Stories have been scoped.
      • The epic has been stack ranked.

      Definition of Done

      • Code Complete:
        • All code has been written, reviewed, and approved.
      • Tested:
        • Unit tests have been written and passed.
        • Integration tests have been completed.
        • System tests have been conducted, and all critical bugs have been fixed.
        • Tested on OpenShift either upstream or downstream on a local build.
      • Documentation:
        • User documentation or release notes have been written.
      • Build:
        • Code has been successfully built and integrated into the main repository / project.
      • Review:
        • Code has been peer-reviewed and meets coding standards.
        • All acceptance criteria defined in the user story have been met.
        • Tested by reviewer on OpenShift.
      • Deployment:
        • The feature has been deployed on OpenShift cluster for testing.
      • Acceptance:
        • Product Manager or stakeholder has reviewed and accepted the work.

              jpitman63 John Pitman
              jgwest Jonathan West
              Scarlet
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: