Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: AI, Pipelines as Code
Labels:
- ai

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Story (Required)

As a security-conscious platform administrator trying to prevent sensitive data leakage to external LLM APIs I want automatic sanitization of secrets and PII before sending context to LLMs so that *confidential
information never leaves our infrastructure*

*This feature leverages Pipelines-as-Code's existing secret sanitization capabilities (used for error snippets and logs) to automatically redact sensitive data before sending pipeline context to external LLM providers.
All secrets attached to PipelineRuns are automatically detected and replaced, with optional configuration for additional patterns like PII, internal hostnames, and organization-specific sensitive data.*

Background (Required)

LLM analysis sends pipeline logs, error messages, commit diffs, and other context to external APIs (OpenAI, Gemini). Without sanitization, this creates risks:

Secrets (API keys, tokens, passwords) could be leaked to external providers
PII (emails, IP addresses) sent to third-party APIs
Internal infrastructure details exposed (hostnames, database URLs)
Compliance violations (GDPR, SOC2, industry regulations)
Customer data accidentally shared externally

Pipelines-as-Code already has robust secret sanitization in pkg/secrets/ that:

Extracts all secrets attached to PipelineRuns
Replaces secret values with *** in logs and error messages
Handles edge cases like secrets with common prefixes (sorts by longest first)

This story extends that capability to sanitize LLM context before sending to external APIs.

Existing secret sanitization: pkg/secrets/secrets.go
Current LLM implementation: pkg/llm/analyzer.go

Out of scope

Redaction of secrets NOT attached to PipelineRun (environment-level secrets)
Automatic detection of all possible sensitive patterns (users must configure additional patterns)
Post-analysis sanitization (only pre-analysis input sanitization)
Secret scanning of repository code files (only runtime data like logs)
Integration with external secret scanning services

Approach (Required)

High-level technical approach:

Reuse existing `pkg/secrets/GetSecretsAttachedToPipelineRun` to extract all secrets from PipelineRun

Reuse existing `pkg/secrets/ReplaceSecretsInText` to sanitize logs, errors, diffs before LLM analysis

Extend sanitization to support additional configurable patterns beyond secrets

*PII patterns (emails, IP addresses, credit cards, SSNs)

* Organization-specific patterns (customer IDs, internal hostnames)

* Token patterns (JWT, Bearer tokens not in secrets)

Apply sanitization to all LLM context items

*Container logs

* Error messages

*Commit diffs

* PR descriptions

* Environment variables

Provide audit logging showing what was redacted and from where

Make sanitization enabled by default (opt-out for testing only)

Ensure sanitized data never stored permanently (only in-memory during analysis)

The feature builds on proven secret handling already in production for PipelineRun error reporting.

Dependencies

Existing pkg/secrets infrastructure for secret extraction and replacement
LLM analysis context assembly (pkg/llm/context)
Repository CRD must support sanitization configuration
May benefit from integration with secret scanning libraries (gitleaks, trufflehog) for pattern detection

Acceptance Criteria (Mandatory)

Given a PipelineRun with secrets attached via `SecretKeyRef`, When LLM analysis runs, Then all secret values are replaced with `***` in the context sent to LLM

Given container logs contain a GitHub token, When the logs are sent to LLM, Then the token is redacted and does not appear in the API request

Given sanitization configuration with email pattern, When error messages contain email addresses, Then emails are replaced with `[EMAIL-REDACTED]`

Given multiple secrets with common prefixes, When sanitizing text, Then longest secrets are replaced first to prevent partial leakage

Given sanitization completes, When reviewing audit logs, Then logs show count of redactions and which patterns matched

Given sanitization is enabled (default), When any LLM analysis runs, Then automatic secret redaction occurs without additional configuration

Given organization-specific patterns configured, When internal customer IDs appear in logs, Then they are redacted according to custom rules

Edge cases to consider:

Secrets that appear multiple times in different contexts
Very long secrets or large volumes of log data
Performance impact of regex pattern matching on large contexts
Secrets with special regex characters
Base64-encoded secrets that decode to reveal values
Partial secret matches or substrings
Unicode or non-ASCII characters in secrets

Details

Description