← Back to lab

How to Measure AI Efficiency in a Team? Tools for Observability

A practical overview of tools for monitoring the costs and performance of AI agents in your workflow.

How to Measure AI Efficiency in a Team? Tools for Observability

Why AI Observability Matters

AI systems degrade silently. Without proper monitoring, you’ll notice failures only when users complain. Observability lets you:

  • Track cost per query
  • Detect performance drift
  • Audit output quality

Key metrics to instrument:

  • Latency (ms/request)
  • Token usage/cost
  • Embedding similarity scores
  • Error rates

Open-Source Tools

LangSmith (for LLMs)

Trace LLM calls like API requests:

from langsmith import Client

client = Client()
run = client.create_run(
    project_name="customer-support",
    inputs={"question": "How do I reset my password?"},
    execution_order=1,
)
# Log outputs/errors later
client.update_run(run.id, outputs={"response": "Visit settings > security..."})

**Pros**: Free tier, fine-grained tracing

**Cons**: Only for LLMs

Datasette

Analyze SQLite logs with a web UI:

datasette serve queries.db --metadata metadata.json

**Use case**:

  1. Log all AI agent inputs/outputs to SQLite
  2. Create dashboards with `metadata.json`

Prometheus + Grafana

For infrastructure monitoring:

  • GPU utilization
  • Memory leaks
  • API uptime

Enterprise Solutions

| Tool | Best For | Pricing Model |

|------------|-------------------|---------------|

| Arize AI | LLM evals | Per inference |

| WhyLabs | Data drift | Monthly SaaS |

| Datadog | Full-stack | Usage-based |

**Actionable tip**: Start with open-source, then scale to enterprise if you need:

  • SOC2 compliance
  • Custom SLAs
  • Team collaboration

Implementation Example

**Case Study**: Monitoring a customer support bot with Datasette

  1. Log every interaction:
import sqlite3

conn = sqlite3.connect('queries.db')
conn.execute('''CREATE TABLE IF NOT EXISTS logs 
             (id TEXT, timestamp DATETIME, input TEXT, output TEXT, tokens INT)''')
  1. Analyze top failure modes:
SELECT input, COUNT(*) as errors 
FROM logs 
WHERE output LIKE '%sorry%' 
GROUP BY input 
ORDER BY errors DESC 
LIMIT 5;
  1. Set up alerts for anomalies (e.g., token usage spikes).

**Key takeaway**: Instrument early. Even basic SQL logging beats no visibility.

Final Checklist

  • [ ] Log inputs/outputs
  • [ ] Track costs per agent
  • [ ] Set up 1 critical alert (e.g., error rate >5%)
  • [ ] Review logs weekly for drift