-
Epic
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
None
-
None
-
GHA Migration
-
False
-
-
False
-
To Do
-
0% To Do, 0% In Progress, 100% Done
Background
The Zuul instance on which we rely for nearly all of our CI has become largely unusable. We lack the resources and knowledge to address this problem, and as such, we are unable to maintain the collections for which we are responsible (for example, see https://github.com/ansible-collections/kubernetes.core/pull/575). This is not a new problem, but it has reached a state that can no longer be ignored. Even if we could find a way to fix the current problems, history strongly suggests we will eventually find ourselves back in a similar scenario.
Our current CI has grown organically over years in an environment that lacked any clear overarching responsibility for the service. It is the product of the hard work of many people trying to solve problems as they have arisen. Unfortunately, this kind of growth has led to the situation we now find ourselves in. Broadly speaking, the problems we face are:
- A lack of knowledge and expertise - Zuul is a complicated tool. We do not have the expertise to address problems with it in a timely manner.
- A high rate of change - The configuration for our CI jobs is frequently subject to change. This makes it harder to keep documentation up to date, and harder for people on the team to keep up with the current state of CI configuration.
- Large variance across repos - The list of which jobs are run, how they are run, and the environment in which they are run varies greatly from repo to repo. This increases the complexity and maintenance burden of CI management.
- Frequent instability - The hosted Zuul service we use frequently suffers from problems including things such as expired certs, node failures, networking outtages, and general performance issues.
In an environment where CI maintenance is a burden shared across the whole team, we have to employ a strategy that makes it easy for everyone to take part. We cannot have a single CI expert, because that position does not exist. Our approach to CI should account for this reality by:
- Offloading the burden of managing infrastructure
- Reducing the rate of change, and better managing how it is communicated and made
- Minimizing the complexity by standardizing tooling and processes across collections
- Balancing the features we choose to implement in our CI chain with the level of resources we have available
- Writing and maintaining better documentation
In addition to moving our own repos off of Zuul, we need to account for the fact that a large number of community repos, and some supported repos, are also still on Zuul. The Cloud Content team are the defacto maintainers of Zuul at this point. While we cannot be responsible for migrating every repo that currently exists on Zuul, we do need to ensure that we are communicating with those who may be affected by this.
All work should follow the process outlined in the Cloud Content Handbook (https://github.com/ansible-collections/cloud-content-handbook). Specifically, changes to CI should be documented before they are implemented. The next steps document will serve as an overall guide for this CI work.
For reference, the list of repos that need to be migrated:
- https://github.com/ansible-collections/amazon.aws
- https://github.com/ansible-collections/amazon.cloud
- https://github.com/ansible-collections/cloud.common
- https://github.com/ansible-collections/cloud.terraform
- https://github.com/ansible-collections/community.aws
- https://github.com/openshift/community.okd
- https://github.com/ansible-collections/community.vmware
- https://github.com/ansible-collections/kubernetes.core
- https://github.com/ansible-collections/vmware.vmware_rest
- https://github.com/redhat-cop/cloud.aws_ops
- https://github.com/redhat-cop/cloud.aws_troubleshooting
- https://github.com/redhat-cop/cloud.azure_ops
- https://github.com/redhat-cop/cloud.gcp_ops
Definition of Done
- CI processes for our current repos that run in Zuul are running on GHA
- Our CI implementation is fully documented
- Ongoing maintenance of Zuul is either transferred to someone else, or those who remain on Zuul have a path forward for also migrating off Zuul.