Master LLM cost monitoring by tracking token usage at the feature level. Learn to implement granular accounting and per-feature budgeting for your AI apps.
Last month, our cloud bill spiked by nearly 40% because a new "Summarize" feature was being abused by scrapers. We had no idea which feature drove the usage because our logs only showed total tokens spent per API key, not per functional module.
If you’re building production AI apps, you can't rely on high-level dashboard metrics from OpenAI or Anthropic. You need to build your own internal accounting layer to make sense of the chaos.
When you start integrating LLMs, the first thing you do is log the total token count returned by the API. That’s fine for a prototype, but it fails the moment you have multiple features—like chat, summarization, and data extraction—sharing the same backend infrastructure.
We first tried adding a simple feature_name tag to our logs. It worked until we realized that asynchronous background jobs were firing requests without context, and our middleware was stripping metadata. We were left with "unknown" as our top-spending feature, which is the worst possible answer for a CFO asking where the money went.
To fix this, we moved away from passive logging and toward an active interceptor pattern. You need a centralized service—or at least a shared library—that wraps your LLM client and forces a context injection.
In our Python backend (using langchain and openai 1.30.0), we implemented a decorator that attaches a feature_id to every request. Here is the gist of how we capture that data:
PYTHONimport time from functools import wraps def track_token_usage(feature_id): def decorator(func): @wraps(func) async def wrapper(*args, **kwargs): start = time.perf_counter() response = await func(*args, **kwargs) # Extract tokens from the response object usage = response.usage log_usage_to_db( feature=feature_id, input_tokens=usage.prompt_tokens, output_tokens=usage.completion_tokens, duration=time.perf_counter() - start ) return response return wrapper return decorator
By forcing the feature_id at the decorator level, we eliminate the chance of developers forgetting to report usage. This data is then piped into a time-series database like Prometheus or ClickHouse.
Once you have the raw data, you need to turn it into something readable. We use a simple normalization script that calculates cost based on the model version.
For instance, gpt-4o costs are different from gpt-4-turbo. Your accounting layer must be aware of these price changes. Hardcoding prices is a mistake; we keep a JSON config file that we update whenever the provider releases a pricing change.
When you have this granular data, you can start building:
Tracking tokens is only half the battle. You should also look at LLM Cost Control: Implementing Per-User Quotas and Rate Limiting to prevent runaway costs from affecting your overall infrastructure stability. Once you have usage under control, you might also consider LLM Routing for Production: Dynamic Task Classification & Scaling to ensure you're using the most cost-effective model for the task at hand.
We’ve found that using cheaper, smaller models for simple classification tasks and reserving the "heavy" models for complex reasoning is the single most effective way to optimize AI spend.
This isn't a "set it and forget it" system. We still struggle with edge cases. For example, streaming responses make token tracking tricky because you don't know the final token count until the stream finishes. We had to implement a buffer that collects the full response chunks before committing the token count to our database. It adds about 20-30ms of latency, but the accounting accuracy is worth the trade-off.
I’m still not satisfied with how we handle retries. Currently, if an API call fails and we retry, our system counts both the failed tokens and the successful ones as separate events. It’s an area I plan to refactor next month to ensure we only bill for successful completions.
Building a robust system for LLM cost monitoring is effectively a FinOps exercise. You’re moving from "we hope this doesn't cost too much" to "we know exactly what every feature costs to run." It’s tedious work, but it’s the only way to scale an AI-powered product without bleeding cash.
Multi-model consensus is a reliable way to reduce LLM hallucinations. Learn how to build verification loops that validate outputs for production-grade reliability.
Read moreMaster LLM context window limits with effective document chunking and recursive summarization. Learn how to build scalable RAG pipelines for large files.