DatabasesJune 23, 20264 min read

Database Consistency via Read-Repair: Solving Cache Inconsistency

Database consistency is often sacrificed for speed, but read-repair patterns help you bridge the gap. Learn how to fix stale cache data without global locks.

databasesredisperformancesystem-designcachingbackendPostgreSQLMySQLDatabase

Last month, I spent about three days chasing a bug where users were seeing "updated" profile settings that reverted to old values five seconds later. We were relying on a standard write-through cache, but a network blip between our primary database and Redis caused a race condition that left the cache in a perpetually stale state. We didn't want to implement heavy distributed locking, so I refactored our data access layer to use a read-repair pattern.

Understanding the Read-Repair Pattern

In a distributed system, eventual consistency is often the only realistic goal. Instead of trying to force perfect synchronicity across your entire stack at the moment of write, you accept that the cache might lag. The read-repair pattern shifts the burden of synchronization to the read path.

When your application requests a record, it doesn't just trust the cache. It performs a lightweight version check or timestamp comparison against the source of truth. If the cache is stale, the application updates it on the fly.

This is fundamentally different from API design caching strategies: mastering read-through and consistency, where you typically hope the write-side invalidation logic is perfect. With read-repair, you’re building a self-healing mechanism into your queries.

The Implementation Strategy

We were using PostgreSQL 15 and Redis 7. Our initial approach was simple: fetch from cache, and if it's a miss, fetch from DB and write to cache. The problem was that the cache entry remained stale if a write failed to invalidate it correctly.

Here is how I adjusted the read logic to ensure better database consistency:


Go
func GetUserProfile(ctx context.Context, userID string) (*User, error) {
    // 1. Fetch from Cache
    cachedUser, err := redis.Get(ctx, "user:"+userID)
    
    // 2. Fetch from DB (Source of Truth)
    dbUser, err := db.Query("SELECT * FROM users WHERE id = $1", userID)
    
    // 3. Compare and Repair
    if cachedUser != nil && cachedUser.Version < dbUser.Version {
        // The cache is stale, update it immediately
        redis.Set(ctx, "user:"+userID, dbUser)
        return dbUser, nil
    }
    
    return cachedUser, nil
}

This approach adds a minor latency penalty because we are hitting the database even on a cache hit. To mitigate this, we implemented a "probabilistic" check where we only verify the version against the database 10% of the time, or when a specific "dirty" flag is set in a secondary metadata store.

Why Avoid Global Locks?

During my investigation, I considered using SELECT FOR UPDATE or distributed locks via Redlock. I quickly abandoned that idea. Distributed locks introduce a massive performance bottleneck; if your lock manager goes down or experiences latency, your entire application grinds to a halt.

By choosing read-repair over locking, we kept our throughput high. We accepted that a user might see stale data for a few hundred milliseconds, but the system would always eventually converge to the correct state without risking a deadlock or a complete service outage.

When to Use This Pattern

You should consider this pattern if:

Your system has high read volume and moderate write volume.
You are seeing intermittent "ghost" data issues after updates.
You want to avoid the complexity of distributed transaction managers.

It’s worth noting that if your application requires strict ACID compliance for every single read, this pattern won't suffice. You might need to look into database caching: implementing redis write-through for consistency to tighten your write-side guarantees, but even then, read-repair acts as a necessary safety net for when invalidation events are dropped.

Lessons Learned

If I were to rebuild this today, I’d be more careful about the versioning schema. We used a simple integer, but it caused issues when we migrated to a multi-master database setup where timestamps are more reliable.

Also, don't forget that read-repair isn't a silver bullet for distributed caching issues. It works best when the cost of a slightly stale read is low. If you're dealing with financial transactions, stick to strong consistency models. For user profiles, settings, or content feeds, read-repair is an elegant way to keep things synchronized without the overhead of global locks.

FAQ

Does read-repair increase database load? Yes, it does. By checking the database on a cache hit, you increase the number of read queries. You should use a sampling strategy or a "dirty flag" to verify only when necessary.

How is this different from cache invalidation? Cache invalidation is a proactive attempt to delete or update data when a write occurs. Read-repair is a reactive, defensive mechanism that fixes data when it is accessed.

Can this handle race conditions? It reduces the window of inconsistency significantly. While it doesn't prevent all race conditions, it ensures that the system doesn't stay in an inconsistent state indefinitely.

Back to Blog

Database Consistency via Read-Repair: Solving Cache Inconsistency

Understanding the Read-Repair Pattern

The Implementation Strategy

Why Avoid Global Locks?

When to Use This Pattern

Lessons Learned

FAQ

Similar Posts

Write-behind caching: Scaling high-throughput database writes

Bloom filters for efficient membership testing in high-cardinality data

Database Caching: Mastering the Cache-Aside Pattern for Scale