Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-9107

Add data sanitization for LLM analysis using existing secret hiding capabilities

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Pipelines as Code

      Story (Required)

      As a security-conscious platform administrator trying to prevent sensitive data leakage to external LLM APIs I want automatic sanitization of secrets and PII before sending context to LLMs so that *confidential
      information never leaves our infrastructure*

      *This feature leverages Pipelines-as-Code's existing secret sanitization capabilities (used for error snippets and logs) to automatically redact sensitive data before sending pipeline context to external LLM providers.
      All secrets attached to PipelineRuns are automatically detected and replaced, with optional configuration for additional patterns like PII, internal hostnames, and organization-specific sensitive data.*

      Background (Required)

      LLM analysis sends pipeline logs, error messages, commit diffs, and other context to external APIs (OpenAI, Gemini). Without sanitization, this creates risks:

      • Secrets (API keys, tokens, passwords) could be leaked to external providers
      • PII (emails, IP addresses) sent to third-party APIs
      • Internal infrastructure details exposed (hostnames, database URLs)
      • Compliance violations (GDPR, SOC2, industry regulations)
      • Customer data accidentally shared externally

      Pipelines-as-Code already has robust secret sanitization in pkg/secrets/ that:

      • Extracts all secrets attached to PipelineRuns
      • Replaces secret values with *** in logs and error messages
      • Handles edge cases like secrets with common prefixes (sorts by longest first)

      This story extends that capability to sanitize LLM context before sending to external APIs.

      Related:

      • Existing secret sanitization: pkg/secrets/secrets.go
      • Current LLM implementation: pkg/llm/analyzer.go

      Out of scope

      • Redaction of secrets NOT attached to PipelineRun (environment-level secrets)
      • Automatic detection of all possible sensitive patterns (users must configure additional patterns)
      • Post-analysis sanitization (only pre-analysis input sanitization)
      • Secret scanning of repository code files (only runtime data like logs)
      • Integration with external secret scanning services

      Approach (Required)

      High-level technical approach:

      Reuse existing pkg/secrets/GetSecretsAttachedToPipelineRun to extract all secrets from PipelineRun

      Reuse existing pkg/secrets/ReplaceSecretsInText to sanitize logs, errors, diffs before LLM analysis

      Extend sanitization to support additional configurable patterns beyond secrets

      *PII patterns (emails, IP addresses, credit cards, SSNs)

      * Organization-specific patterns (customer IDs, internal hostnames)

      * Token patterns (JWT, Bearer tokens not in secrets)

      Apply sanitization to all LLM context items

      *Container logs

      * Error messages

      *Commit diffs

      * PR descriptions

      * Environment variables

      Provide audit logging showing what was redacted and from where

      Make sanitization enabled by default (opt-out for testing only)

      Ensure sanitized data never stored permanently (only in-memory during analysis)

      The feature builds on proven secret handling already in production for PipelineRun error reporting.

      Dependencies

      • Existing pkg/secrets infrastructure for secret extraction and replacement
      • LLM analysis context assembly (pkg/llm/context)
      • Repository CRD must support sanitization configuration
      • May benefit from integration with secret scanning libraries (gitleaks, trufflehog) for pattern detection

      Acceptance Criteria (Mandatory)

      Given a PipelineRun with secrets attached via SecretKeyRef, When LLM analysis runs, Then all secret values are replaced with *** in the context sent to LLM

      Given container logs contain a GitHub token, When the logs are sent to LLM, Then the token is redacted and does not appear in the API request

      Given sanitization configuration with email pattern, When error messages contain email addresses, Then emails are replaced with [EMAIL-REDACTED]

      Given multiple secrets with common prefixes, When sanitizing text, Then longest secrets are replaced first to prevent partial leakage

      Given sanitization completes, When reviewing audit logs, Then logs show count of redactions and which patterns matched

      Given sanitization is enabled (default), When any LLM analysis runs, Then automatic secret redaction occurs without additional configuration

      Given organization-specific patterns configured, When internal customer IDs appear in logs, Then they are redacted according to custom rules

      Edge cases to consider:

      • Secrets that appear multiple times in different contexts
      • Very long secrets or large volumes of log data
      • Performance impact of regex pattern matching on large contexts
      • Secrets with special regex characters
      • Base64-encoded secrets that decode to reveal values
      • Partial secret matches or substrings
      • Unicode or non-ASCII characters in secrets

              Unassigned Unassigned
              cboudjna@redhat.com Chmouel Boudjnah
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: