Metrics Matter: Log What the LLM Says

AI automation may impress in a demo, but it often fails in production, not because the model is wrong, but because no one knows what it said when it mattered. If you don’t log LLM behavior, your automation is a black box.

In this article, you’ll implement a simple, powerful step toward observability: log every LLM interaction as structured JSON to a file.

Why This Matters

When a failure happens in production, these are the questions you’ll want to answer:

What was the user asking?
Which prompt template ran?
What did the LLM respond with?
Was that response slow, expensive, or weird?
Has this happened before?

If you can’t answer those questions, you can’t fix the system. But if you log every interaction with the right metadata, you’ll see patterns, fix bugs, and even improve model performance over time.

What to Do

At minimum, log:

Timestamp
Prompt template or name
User input
Full prompt (after templating)
Model response
Session ID
Latency
Token usage
Success/failure flag

Keep it structured so it’s easy to query later.

Production Tip

Logging in structured JSON format makes it easier to send data to log analysis tools like Logstash, Fluentd, or Datadog later. Even if you’re starting small, use structured logs from day one. And instead of printing to the console, write logs to a dedicated log file in append mode.

Code Example: Log LLM Prompts and Responses to File

Here’s a minimal working version that writes LLM interactions to a log file.

import openai
import logging
import time
import uuid
import json

# Set up logging to a file with structured JSON format
logging.basicConfig(
    filename='llm_interactions.log',
    level=logging.INFO,
    format='%(message)s',
    filemode='a'  # Append mode
)

def call_llm(prompt_template, user_input):
    session_id = str(uuid.uuid4())
    start_time = time.time()

    prompt = prompt_template.format(user_input=user_input)

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        latency = round(time.time() - start_time, 2)
        completion = response['choices'][0]['message']['content']
        tokens_used = response['usage']['total_tokens']

        log_data = {
            "timestamp": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
            "session_id": session_id,
            "prompt_template": "basic_question",
            "input": user_input,
            "prompt": prompt,
            "response": completion,
            "tokens": tokens_used,
            "latency_seconds": latency,
            "status": "success"
        }

        logging.info(json.dumps(log_data))
        return completion

    except Exception as e:
        latency = round(time.time() - start_time, 2)
        error_log = {
            "timestamp": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
            "session_id": session_id,
            "prompt_template": "basic_question",
            "input": user_input,
            "error": str(e),
            "latency_seconds": latency,
            "status": "failure"
        }

        logging.error(json.dumps(error_log))
        return "Something went wrong."

# Example usage
response = call_llm("What's a short summary of this question: '{user_input}'", "How do LLMs differ from traditional ML?")
print(response)

Example Output

Here’s an example of what one of those logs might look like inside llm_interactions.log:

{
  "timestamp": "2025-08-03T11:52:00Z",
  "session_id": "8b96fbe4-3925-4f85-9a8e-1f746d71f558",
  "prompt_template": "basic_question",
  "input": "How do LLMs differ from traditional ML?",
  "prompt": "What's a short summary of this question: 'How do LLMs differ from traditional ML?'",
  "response": "LLMs use massive text data and transformer architecture to generate language-based outputs, unlike traditional ML models which are typically trained for specific structured tasks.",
  "tokens": 78,
  "latency_seconds": 1.87,
  "status": "success"
}

Going Further

As you scale:

Pipe logs to Logstash, Fluentd, or a cloud logging service.
Add request IDs to correlate logs across systems.
Set alerts for latency spikes or error patterns.

Final Thought

Observability isn’t just for engineers. It’s for AI, too. Logging LLM interactions helps your team detect bugs, optimize prompts, control costs, and prevent surprises. It’s one of the lowest-effort, highest-impact improvements you can make today.

So don’t guess what your AI is doing. Log it.

Metrics Matter: Log What the LLM Says

Why This Matters

What to Do

Production Tip

Code Example: Log LLM Prompts and Responses to File

Example Output

Going Further

Final Thought

Top Business Use Cases for AI in 2025

Disillusioned by the AI Hype

Overcoming the AI Hype with AI Strategy

Giving Memory to LLMs

Use Case Discovery: Finding High-Value AI Opportunities

AI Guardrails Aren’t Just for Compliance

Leave a Reply Cancel reply

Why This Matters

What to Do

Production Tip

Code Example: Log LLM Prompts and Responses to File

Example Output

Going Further

Final Thought

Similar Posts

Leave a Reply Cancel reply