-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.22.0
-
None
We continue to see the outage similar to OCPBUGS-67318 where at the exact same point in testing, we seem to lose the apiserver. Disruption, CPU spikes, mostly in iowait, etcd is thrashing, and random tests start failing. It's much less common now after the revert for the other bug, but it's still happening and it didn't really before.
Still showing just in the same micro upgrade job, it's rare, but it's out there.
Using the above 4.21 run I see node_vmstat_pgmajfault spike at the time of outage, again because we think etcd is choking on memory and swapping out.
etcd_mvcc_db_total_size_in_bytes / 1024 / 1024
Shows a big immediate spike from about 140MB to 225MB at the time of our outage. This is a huge increase in etcd size.
I got suspicious about the Feature:ProjectAPI tests that have been showing up flaky for such a long time. Claude thinks they could explain this, I will post it's analysis in a comment as I don't yet know if it's BS.
I'm going to try using this increase in db size as a way to pin down what test is doing it, starting with the ProjectAPI tests. I can see the spike even in successful job runs, just not quite as big. There's an even bigger spike much earlier in all job runs, but we'll deal with that later.