Anonymize Logs Before You Regret It
LLM-powered automations often rely on detailed logs for debugging and monitoring. But if you’re logging everything – including user prompts, PII, or internal messages – you may be creating a privacy risk waiting to blow up in your face.
Today’s tip: Anonymize your logs before they go to disk.
Why This Matters
When a user interacts with your AI system, their inputs often include sensitive data: names, email addresses, locations, or confidential business info. If you’re logging that data directly (e.g. prompts or full conversations), you’re opening yourself up to:
- Compliance violations (GDPR, HIPAA, CCPA, etc.)
- Security risks (exposed secrets or tokens)
- Embarrassing breaches (internal data accidentally logged)
If you can’t answer those questions, you can’t fix the system. But if you log every interaction with the right metadata, you’ll see patterns, fix bugs, and even improve model performance over time.
What to Do
Create a log wrapper that sanitizes sensitive fields from structured data like prompts and responses. You can:
- Replace names/emails with placeholder tokens
- Mask or hash session/user IDs
- Redact known PII formats (emails, phone numbers, etc.)
This balances observability with compliance and safety, a must for any production AI stack.
Production Tip
Logging everything might help you debug, but it can also get you sued. Anonymize your logs before storing them.
Code Example
This is a modification of the logging code from the Metrics Matter article. It logs structured prompt/response data, but anonymizes emails and user IDs before writing to disk.
import logging
import json
import re
# Set up log file output
logging.basicConfig(
filename="llm_output.log",
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s"
)
EMAIL_RE = re.compile(r"\b[\w\.-]+@[\w\.-]+\.\w+\b")
USER_ID_RE = re.compile(r"user_\d+")
def anonymize(data: dict) -> dict:
data_copy = json.loads(json.dumps(data)) # Deep copy
# Redact email addresses
if "prompt" in data_copy:
data_copy["prompt"] = EMAIL_RE.sub("[REDACTED_EMAIL]", data_copy["prompt"])
# Redact emails and user IDs in metadata
if "metadata" in data_copy:
for key, value in data_copy["metadata"].items():
if isinstance(value, str):
value = EMAIL_RE.sub("[REDACTED_EMAIL]", value)
value = USER_ID_RE.sub("[REDACTED_USER_ID]", value)
data_copy["metadata"][key] = value
return data_copy
def log_prompt_response(prompt: str, response: str, metadata: dict):
log_data = {
"prompt": prompt,
"response": response,
"metadata": metadata
}
safe_data = anonymize(log_data)
logging.info(json.dumps(safe_data, indent=2))
# Example usage
log_prompt_response(
prompt="My email is jane.doe@example.com and I want a summary.",
response="Summary: You asked for a summary.",
metadata={"user_id": "user_12345", "session_id": "abc123"}
)
Example Output
Here’s an example of what one of those logs might look like inside llm_interactions.log:
{
"prompt": "My email is [REDACTED_EMAIL] and I want a summary.",
"response": "Summary: You asked for a summary.",
"metadata": {
"user_id": "[REDACTED_USER_ID]",
"session_id": "abc123"
}
}
Going Further
As you scale:
- Use more advanced PII detection tools (e.g. presidio, pii-extract)
- Redact phone numbers, addresses, credit cards, etc.
- Normalize log formats using JSON loggers (e.g. structlog, loguru)
- Use log processors like Fluentd or Logstash to enforce sanitization before forwarding logs
- Combine with filtering/alerting from your log aggregation system (see: Log Aggregation & Alerting)
Final Thought
You only need one leaked email in a log file to lose user trust. You don’t need perfection, but you do need a plan. Start with anonymization to make your AI project more production ready.