Uploaded image for project: 'Red Hat Internal Developer Platform'
  1. Red Hat Internal Developer Platform
  2. RHIDP-4642

Audit Log file write errors (e.g., when the audit log volume is full) do not seem to be handled

Create Doc EPIC for Fe...Prepare for Y ReleasePrepare for Z ReleaseRemove QuarterXMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • 1.4
    • 1.3
    • Audit Log
    • None
    • 2
    • False
    • Hide

      None

      Show
      None
    • False
    • RHDH Security 3265

      Description of problem:

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      1. Deploy the app (using the Helm Chart here because it creates an audit log PVC out of the box since RHIDP-3240 (https://github.com/redhat-developer/rhdh-chart/pull/36)), but I guess we'll get the same issue regardless of where the audit log files are stored
      $ git clone https://github.com/redhat-developer/rhdh-chart.git && cd rhdh-chart
      $ helm upgrade --install my-backstage-with-audit-log charts/backstage \
        --set upstream.backstage.appConfig.auditLog.rotateFile.frequency=5m \
        --set global.clusterRouterBase=$(oc get ingress.config.openshift.io/cluster '-o=jsonpath={.spec.domain}')
      1. Wait until the application starts
      2. Check the audit log PVC volume is mounted in the app container, e.g.:
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- df -hT /var/log/redhat-developer-hub/audit
      
      Filesystem     Type  Size  Used Avail Use% Mounted on
      /dev/nvme7n1   ext4  974M   32K  958M   1% /var/log/redhat-developer-hub/audit 
      1. Interact with the application and make sure the audit log volume is being written to, e.g.:
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- ls -lh /var/log/redhat-developer-hub/audit/
      
      total 108K
      drwxrws---. 2 root       1000780000 16K Oct 22 09:02 lost+found
      -rw-rw-r--. 1 1000780000 1000780000 87K Oct 22 09:31 redhat-developer-hub-audit-2024-10-22.log
       
      1. Also make sure audit logs are being displayed in the pod logs
      $ kubectl logs deploy/my-backstage-with-audit-log | grep isAuditLog
      
      [...]
      {"actor":{"actorId":"user:default/rm3l","hostname":"my-backstage-with-audit-log-my-ns.apps.rosa.wg27p-yympk-d2y.1khx.p3.openshiftapps.com","ip":"::ffff:10.128.0.15","userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"},"eventName":"BulkImportFindAllImports","isAuditLog":true,"level":"info","message":"'GET /imports' endpoint hit by user:default/rm3l","meta":{},"plugin":"bulk-import","request":{"body":{},"method":"GET","params":{},"query":{"pagePerIntegration":"1","search":"","sizePerIntegration":"5"},"url":"/api/bulk-import/imports?pagePerIntegration=1&sizePerIntegration=5&search="},"response":{"status":304},"service":"backstage","stage":"completion","status":"succeeded","timestamp":"2024-10-22 09:35:47"}
      [...]
       
      1. Now manually fill the audit log volume with fake data (just to make sure the disk is full). This is done so to simulate a disk being full before the audit log file is rotated with the default settings (midnight local system time), as defined in https://github.com/redhat-developer/rhdh-chart/pull/45/files#diff-8060d9a38501197eddb026fced7fe56f104f4f5143210c6d47781b831a7097f4R50-R52
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- fallocate -l 1G /var/log/redhat-developer-hub/audit/my-big-file        
      
      fallocate: fallocate failed: No space left on device
      command terminated with exit code 1 
      1. Verify that the disk is full
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- df -hT /var/log/redhat-developer-hub/audit
      
      Filesystem     Type  Size  Used Avail Use% Mounted on
      /dev/nvme7n1   ext4  974M  958M     0 100% /var/log/redhat-developer-hub/audit 
      1. Interact with the application and make sure the audit logs are still being displayed in the pod logs
      $ kubectl logs deploy/my-backstage-with-audit-log | grep isAuditLog
      [...]
      {"actor":{"actorId":"user:default/rm3l","hostname":"my-backstage-with-audit-log-my-ns.apps.rosa.wg27p-yympk-d2y.1khx.p3.openshiftapps.com","ip":"::ffff:10.128.0.15","userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"},"eventName":"BulkImportFindImportStatusByRepo","isAuditLog":true,"level":"info","message":"'GET /import/by-repo' endpoint hit by user:default/rm3l","meta":{},"plugin":"bulk-import","request":{"body":{},"method":"GET","params":{},"query":{"defaultBranch":"main","repo":"https://github.com/lemra-org-ent-2/test-starter"},"url":"/api/bulk-import/import/by-repo?repo=https://github.com/lemra-org-ent-2/test-starter&defaultBranch=main"},"response":{"status":304},"service":"backstage","stage":"completion","status":"succeeded","timestamp":"2024-10-22 09:51:10"} 
      
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- date
      Tue Oct 22 09:51:11 UTC 2024
      1. If we look at the audit log file in the volume, we can see that it is no longer being written to (last audit log message was written at 2024-10-22 09:45:57), and there is no new audit log file (actually this would not even be possible because the audit log volume is full due to an unrelated big file)
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- ls -lh /var/log/redhat-developer-hub/audit/
      
      drwxrws---. 2 root       1000780000  16K Oct 22 09:02 lost+found
      -rw-r--r--. 1 1000780000 1000780000 958M Oct 22 09:45 my-big-file
      -rw-rw-r--. 1 1000780000 1000780000 172K Oct 22 09:45 redhat-developer-hub-audit-2024-10-22.log 
      
      $ kubectl exec -it deploy/my-backstage-with-audit-log -c backstage-backend -- tail /var/log/redhat-developer-hub/audit/redhat-developer-hub-audit-2024-10-22.log
      
      [...]
      {"actor":{"actorId":"plugin:bulk-import","hostname":"localhost","ip":"::1","userAgent":"node-fetch/1.0 (+https://github.com/bitinn/node-fetch)"},"eventName":"CatalogLocationFetch","isAuditLog":true,"level":"info","message":"Fetch attempt of locations by plugin:bulk-import succeeded","meta":{},"plugin":"catalog","request":{"body":{},"method":"GET","params":{},"query":{},"url":"/api/catalog/locations"},"response":{"status":200},"service":"backstage","stage":"completion","status":"succeeded","timestamp":"2024-10-22 09:45:57"}
      {"actor":{"actorId":"user:default/rm3l","hostname":"my-backstage-with-audit-log-my-ns.apps.rosa.wg27p-yympk-d2y.1khx.p3.openshiftapps.com","ip":"::ffff:
      •  

      Actual results:

      • New audit log messages no longer written to the file
        • It looks like rotation is never triggered. So the file is never rotated/overwritten even with a frequency set to 5 minutes as above. At least, it was not clear how to check if rotation did happen or not.
        • Worse, I observed that audit logs cease to be written to the file even after the disk is cleaned up and no longer full (after being full once)
      • No warning message in the pod logs that the audit log volume is full

      Expected results:

      • Maybe try to rotate/overwrite the audit log file if possible
      • Audit logs should continue to be written to the file after the disk is cleaned up and now has capacity
      • I would expect at least the application to handle audit log file writing errors and surface the errors in the main container logs?

      Reproducibility (Always/Intermittent/Only Once):

      Always, when the audit log volume is full before the audit log file is rotated (which might happen if the application is being used extensively by users).

      Also, right now, this can be impactful since I think there is currently no way to override the audit log PVC size (and storage class), at least using the Helm Chart.

      Build Details:

      Additional info (Such as Logs, Screenshots, etc):

      Slack thread: https://redhat-internal.slack.com/archives/C06SGFD65NC/p1729160334651819?thread_ts=1729095907.039489&cid=C06SGFD65NC

              rh-ee-jhe Jessica He
              rh-ee-asoro Armel Soro
              Corey Daley
              RHIDP - Security
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: