Traditional software fails in binary ways. A request returns a 500 or it does not. A database query times out or it completes. The monitoring tools we have built over the past two decades -- Datadog, New Relic, PagerDuty -- are designed for this world. They watch for hard failures: error rates, latency spikes, resource exhaustion.
AI systems fail differently. A language model that returns a 200 OK with confidently wrong information has not failed by any metric your APM tool tracks. A model that produces valid JSON containing a hallucinated statistic will pass every health check. A system that costs three times more than expected because a prompt redesign doubled token consumption will show green across every dashboard until the invoice arrives.
AI observability requires monitoring five dimensions that traditional tools either ignore or handle poorly: latency distribution, token usage, output quality, error classification, and cost per request. Each dimension has its own instrumentation requirements, its own failure signatures, and its own alerting thresholds. This article covers all five, with implementation details for each.