Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
AI/MLJune 23, 20264 min read

LLM Cost Monitoring: A Guide to Granular Token Accounting

Master LLM cost monitoring by tracking token usage at the feature level. Learn to implement granular accounting and per-feature budgeting for your AI apps.

LLMAIFinOpsObservabilityPythonCostsInfrastructureRAGPrompt Engineering

Last month, our cloud bill spiked by nearly 40% because a new "Summarize" feature was being abused by scrapers. We had no idea which feature drove the usage because our logs only showed total tokens spent per API key, not per functional module.

If you’re building production AI apps, you can't rely on high-level dashboard metrics from OpenAI or Anthropic. You need to build your own internal accounting layer to make sense of the chaos.

The Problem with High-Level Observability

When you start integrating LLMs, the first thing you do is log the total token count returned by the API. That’s fine for a prototype, but it fails the moment you have multiple features—like chat, summarization, and data extraction—sharing the same backend infrastructure.

We first tried adding a simple feature_name tag to our logs. It worked until we realized that asynchronous background jobs were firing requests without context, and our middleware was stripping metadata. We were left with "unknown" as our top-spending feature, which is the worst possible answer for a CFO asking where the money went.

Implementing LLM Cost Monitoring

To fix this, we moved away from passive logging and toward an active interceptor pattern. You need a centralized service—or at least a shared library—that wraps your LLM client and forces a context injection.

In our Python backend (using langchain and openai 1.30.0), we implemented a decorator that attaches a feature_id to every request. Here is the gist of how we capture that data:

PYTHON
import time
from functools import wraps

def track_token_usage(feature_id):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.perf_counter()
            response = await func(*args, **kwargs)
            
            # Extract tokens from the response object
            usage = response.usage
            log_usage_to_db(
                feature=feature_id,
                input_tokens=usage.prompt_tokens,
                output_tokens=usage.completion_tokens,
                duration=time.perf_counter() - start
            )
            return response
        return wrapper
    return decorator

By forcing the feature_id at the decorator level, we eliminate the chance of developers forgetting to report usage. This data is then piped into a time-series database like Prometheus or ClickHouse.

Granular Token Accounting at Scale

Once you have the raw data, you need to turn it into something readable. We use a simple normalization script that calculates cost based on the model version.

For instance, gpt-4o costs are different from gpt-4-turbo. Your accounting layer must be aware of these price changes. Hardcoding prices is a mistake; we keep a JSON config file that we update whenever the provider releases a pricing change.

When you have this granular data, you can start building:

  1. Per-feature budgets: If the "Summarization" feature hits $500/month, we trigger a soft alert.
  2. Unit cost per user: You can finally answer, "How much does it cost us for a user to summarize a 50-page PDF?"
  3. Anomaly detection: A sudden spike in usage for a specific feature is a signal of a bug or a malicious actor, not just a general traffic increase.

Connecting Costs to Infrastructure

Tracking tokens is only half the battle. You should also look at LLM Cost Control: Implementing Per-User Quotas and Rate Limiting to prevent runaway costs from affecting your overall infrastructure stability. Once you have usage under control, you might also consider LLM Routing for Production: Dynamic Task Classification & Scaling to ensure you're using the most cost-effective model for the task at hand.

We’ve found that using cheaper, smaller models for simple classification tasks and reserving the "heavy" models for complex reasoning is the single most effective way to optimize AI spend.

The Reality of Implementation

This isn't a "set it and forget it" system. We still struggle with edge cases. For example, streaming responses make token tracking tricky because you don't know the final token count until the stream finishes. We had to implement a buffer that collects the full response chunks before committing the token count to our database. It adds about 20-30ms of latency, but the accounting accuracy is worth the trade-off.

I’m still not satisfied with how we handle retries. Currently, if an API call fails and we retry, our system counts both the failed tokens and the successful ones as separate events. It’s an area I plan to refactor next month to ensure we only bill for successful completions.

Building a robust system for LLM cost monitoring is effectively a FinOps exercise. You’re moving from "we hope this doesn't cost too much" to "we know exactly what every feature costs to run." It’s tedious work, but it’s the only way to scale an AI-powered product without bleeding cash.

Back to Blog

Similar Posts

AI/MLJune 23, 20264 min read

Multi-model consensus: Reducing LLM Hallucinations in Production

Multi-model consensus is a reliable way to reduce LLM hallucinations. Learn how to build verification loops that validate outputs for production-grade reliability.

Read more
AI/MLJune 23, 20264 min read

LLM Context Window Management: Chunking and Summarization Tips

Master LLM context window limits with effective document chunking and recursive summarization. Learn how to build scalable RAG pipelines for large files.

Read more
AI/MLJune 23, 20264 min read

Structured Output with Pydantic: A Guide to Reliable LLM Parsing

Master structured output using Pydantic to enforce JSON schema validation. Stop fighting LLM hallucinations and start building production-ready AI pipelines.

Read more