During the 2022-08-25 OSUS incident, diagnosis was hampered by the lack of access logging in Envoy. There was no way to know where requests were coming from. Therefore, we had a hard time determining whether the increased traffic was from a malicious actor, a bug in the systems legitimately using OSUS, or something else.
To enable access logging in Envoy, it's evidently a configuration change. You'll have to decide whether to log as text, json, elasticsearch, etc.
Enabling Envoy access logging has been done for other platforms.... Can be enabled always, rather than only during an incident. Unclear what the computational cost is of enabling this.
Slack channel : #incident_osus_high_latency_timeout link: https://coreos.slack.com/archives/C03UQ5U2CP9
RCA document: link
cc: lmohanty-ota rvazquez@redhat.com pratikam rhn-it-bhushan rporresm