-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
Story (Required)
As a security-conscious platform administrator trying to prevent sensitive data leakage to external LLM APIs I want automatic sanitization of secrets and PII before sending context to LLMs so that *confidential
information never leaves our infrastructure*
*This feature leverages Pipelines-as-Code's existing secret sanitization capabilities (used for error snippets and logs) to automatically redact sensitive data before sending pipeline context to external LLM providers.
All secrets attached to PipelineRuns are automatically detected and replaced, with optional configuration for additional patterns like PII, internal hostnames, and organization-specific sensitive data.*
Background (Required)
LLM analysis sends pipeline logs, error messages, commit diffs, and other context to external APIs (OpenAI, Gemini). Without sanitization, this creates risks:
- Secrets (API keys, tokens, passwords) could be leaked to external providers
- PII (emails, IP addresses) sent to third-party APIs
- Internal infrastructure details exposed (hostnames, database URLs)
- Compliance violations (GDPR, SOC2, industry regulations)
- Customer data accidentally shared externally
Pipelines-as-Code already has robust secret sanitization in pkg/secrets/ that:
- Extracts all secrets attached to PipelineRuns
- Replaces secret values with *** in logs and error messages
- Handles edge cases like secrets with common prefixes (sorts by longest first)
This story extends that capability to sanitize LLM context before sending to external APIs.
Related:
- Existing secret sanitization: pkg/secrets/secrets.go
- Current LLM implementation: pkg/llm/analyzer.go
Out of scope
- Redaction of secrets NOT attached to PipelineRun (environment-level secrets)
- Automatic detection of all possible sensitive patterns (users must configure additional patterns)
- Post-analysis sanitization (only pre-analysis input sanitization)
- Secret scanning of repository code files (only runtime data like logs)
- Integration with external secret scanning services
Approach (Required)
High-level technical approach:
Reuse existing pkg/secrets/GetSecretsAttachedToPipelineRun to extract all secrets from PipelineRun
Reuse existing pkg/secrets/ReplaceSecretsInText to sanitize logs, errors, diffs before LLM analysis
Extend sanitization to support additional configurable patterns beyond secrets
*PII patterns (emails, IP addresses, credit cards, SSNs)
* Organization-specific patterns (customer IDs, internal hostnames)
* Token patterns (JWT, Bearer tokens not in secrets)
Apply sanitization to all LLM context items
*Container logs
* Error messages
*Commit diffs
* PR descriptions
* Environment variables
Provide audit logging showing what was redacted and from where
Make sanitization enabled by default (opt-out for testing only)
Ensure sanitized data never stored permanently (only in-memory during analysis)
The feature builds on proven secret handling already in production for PipelineRun error reporting.
Dependencies
- Existing pkg/secrets infrastructure for secret extraction and replacement
- LLM analysis context assembly (pkg/llm/context)
- Repository CRD must support sanitization configuration
- May benefit from integration with secret scanning libraries (gitleaks, trufflehog) for pattern detection
Acceptance Criteria (Mandatory)
Given a PipelineRun with secrets attached via SecretKeyRef, When LLM analysis runs, Then all secret values are replaced with *** in the context sent to LLM
Given container logs contain a GitHub token, When the logs are sent to LLM, Then the token is redacted and does not appear in the API request
Given sanitization configuration with email pattern, When error messages contain email addresses, Then emails are replaced with [EMAIL-REDACTED]
Given multiple secrets with common prefixes, When sanitizing text, Then longest secrets are replaced first to prevent partial leakage
Given sanitization completes, When reviewing audit logs, Then logs show count of redactions and which patterns matched
Given sanitization is enabled (default), When any LLM analysis runs, Then automatic secret redaction occurs without additional configuration
Given organization-specific patterns configured, When internal customer IDs appear in logs, Then they are redacted according to custom rules
Edge cases to consider:
- Secrets that appear multiple times in different contexts
- Very long secrets or large volumes of log data
- Performance impact of regex pattern matching on large contexts
- Secrets with special regex characters
- Base64-encoded secrets that decode to reveal values
- Partial secret matches or substrings
- Unicode or non-ASCII characters in secrets