Why AI Observability Matters
AI systems degrade silently. Without proper monitoring, you’ll notice failures only when users complain. Observability lets you:
- Track cost per query
- Detect performance drift
- Audit output quality
Key metrics to instrument:
- Latency (ms/request)
- Token usage/cost
- Embedding similarity scores
- Error rates
Open-Source Tools
LangSmith (for LLMs)
Trace LLM calls like API requests:
from langsmith import Client
client = Client()
run = client.create_run(
project_name="customer-support",
inputs={"question": "How do I reset my password?"},
execution_order=1,
)
# Log outputs/errors later
client.update_run(run.id, outputs={"response": "Visit settings > security..."}) **Pros**: Free tier, fine-grained tracing
**Cons**: Only for LLMs
Datasette
Analyze SQLite logs with a web UI:
datasette serve queries.db --metadata metadata.json **Use case**:
- Log all AI agent inputs/outputs to SQLite
- Create dashboards with `metadata.json`
Prometheus + Grafana
For infrastructure monitoring:
- GPU utilization
- Memory leaks
- API uptime
Enterprise Solutions
| Tool | Best For | Pricing Model |
|------------|-------------------|---------------|
| Arize AI | LLM evals | Per inference |
| WhyLabs | Data drift | Monthly SaaS |
| Datadog | Full-stack | Usage-based |
**Actionable tip**: Start with open-source, then scale to enterprise if you need:
- SOC2 compliance
- Custom SLAs
- Team collaboration
Implementation Example
**Case Study**: Monitoring a customer support bot with Datasette
- Log every interaction:
import sqlite3
conn = sqlite3.connect('queries.db')
conn.execute('''CREATE TABLE IF NOT EXISTS logs
(id TEXT, timestamp DATETIME, input TEXT, output TEXT, tokens INT)''') - Analyze top failure modes:
SELECT input, COUNT(*) as errors
FROM logs
WHERE output LIKE '%sorry%'
GROUP BY input
ORDER BY errors DESC
LIMIT 5; - Set up alerts for anomalies (e.g., token usage spikes).
**Key takeaway**: Instrument early. Even basic SQL logging beats no visibility.
Final Checklist
- [ ] Log inputs/outputs
- [ ] Track costs per agent
- [ ] Set up 1 critical alert (e.g., error rate >5%)
- [ ] Review logs weekly for drift