|

Metrics Matter: Log What the LLM Says

AI automation may impress in a demo, but it often fails in production, not because the model is wrong, but because no one knows what it said when it mattered. If you don’t log LLM behavior, your automation is a black box.

In this article, you’ll implement a simple, powerful step toward observability: log every LLM interaction as structured JSON to a file.

Why This Matters

When a failure happens in production, these are the questions you’ll want to answer:

  • What was the user asking?
  • Which prompt template ran?
  • What did the LLM respond with?
  • Was that response slow, expensive, or weird?
  • Has this happened before?

If you can’t answer those questions, you can’t fix the system. But if you log every interaction with the right metadata, you’ll see patterns, fix bugs, and even improve model performance over time.

What to Do

At minimum, log:

  • Timestamp
  • Prompt template or name
  • User input
  • Full prompt (after templating)
  • Model response
  • Session ID
  • Latency
  • Token usage
  • Success/failure flag

Keep it structured so it’s easy to query later.

Production Tip


Logging in structured JSON format makes it easier to send data to log analysis tools like Logstash, Fluentd, or Datadog later. Even if you’re starting small, use structured logs from day one. And instead of printing to the console, write logs to a dedicated log file in append mode.

Code Example:  Log LLM Prompts and Responses to File


Here’s a minimal working version that writes LLM interactions to a log file.

import openai
import logging
import time
import uuid
import json

# Set up logging to a file with structured JSON format
logging.basicConfig(
    filename='llm_interactions.log',
    level=logging.INFO,
    format='%(message)s',
    filemode='a'  # Append mode
)

def call_llm(prompt_template, user_input):
    session_id = str(uuid.uuid4())
    start_time = time.time()

    prompt = prompt_template.format(user_input=user_input)

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        latency = round(time.time() - start_time, 2)
        completion = response['choices'][0]['message']['content']
        tokens_used = response['usage']['total_tokens']

        log_data = {
            "timestamp": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
            "session_id": session_id,
            "prompt_template": "basic_question",
            "input": user_input,
            "prompt": prompt,
            "response": completion,
            "tokens": tokens_used,
            "latency_seconds": latency,
            "status": "success"
        }

        logging.info(json.dumps(log_data))
        return completion

    except Exception as e:
        latency = round(time.time() - start_time, 2)
        error_log = {
            "timestamp": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
            "session_id": session_id,
            "prompt_template": "basic_question",
            "input": user_input,
            "error": str(e),
            "latency_seconds": latency,
            "status": "failure"
        }

        logging.error(json.dumps(error_log))
        return "Something went wrong."

# Example usage
response = call_llm("What's a short summary of this question: '{user_input}'", "How do LLMs differ from traditional ML?")
print(response)

Example Output


Here’s an example of what one of those logs might look like inside llm_interactions.log:

{
  "timestamp": "2025-08-03T11:52:00Z",
  "session_id": "8b96fbe4-3925-4f85-9a8e-1f746d71f558",
  "prompt_template": "basic_question",
  "input": "How do LLMs differ from traditional ML?",
  "prompt": "What's a short summary of this question: 'How do LLMs differ from traditional ML?'",
  "response": "LLMs use massive text data and transformer architecture to generate language-based outputs, unlike traditional ML models which are typically trained for specific structured tasks.",
  "tokens": 78,
  "latency_seconds": 1.87,
  "status": "success"
}

Going Further

As you scale:

  • Pipe logs to Logstash, Fluentd, or a cloud logging service.
  • Add request IDs to correlate logs across systems.
  • Set alerts for latency spikes or error patterns.

Final Thought

Observability isn’t just for engineers. It’s for AI, too. Logging LLM interactions helps your team detect bugs, optimize prompts, control costs, and prevent surprises. It’s one of the lowest-effort, highest-impact improvements you can make today.

So don’t guess what your AI is doing. Log it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *