Hunting a 0.18% Bug: Redis Lock Expiration in Crunz

When “sporadic” means “mathematically inevitable”

In RedisStore.php line 185:
[Symfony\Component\Lock\Exception\LockConflictedException]

One line. No stack trace. No context. Just a LockConflictedException showing up in our crunz scheduler logs — every single day. Sometimes once, sometimes a dozen times. No clear pattern, no obvious trigger. A textbook non-deterministic bug.

I ignored it for a while. Then I did the math.

Turns out this wasn’t random at all. It was a probabilistic certainty baked into crunz’s lock refresh mechanism — a bug that hits roughly 1 in every 550 lock operations. With 42 Redis-locked cron jobs running on schedules throughout the day, the question was never if it happens, but how often.

What’s a Lock, and Why Does It Expire?

If you’re already comfortable with distributed locking and mutual exclusion, skip ahead. For everyone else — a quick mental model.

Think of a lock like a parking meter. You feed it coins (a TTL — time to live), and as long as there’s time on the meter, the parking spot is yours. Nobody else can take it. But the moment it runs out, the spot is fair game — even if you’re still in the store buying groceries.

Redis locks work the same way. Crunz — a PHP cron job scheduler — uses the Symfony Lock component with a RedisStore to prevent overlapping task execution. When a task starts, it acquires a lock with a time limit via ZADD (Redis’s “sorted set add” command — it stores a unique token with a score representing the expiration timestamp). As long as the task keeps refreshing that lock before it expires, it’s safe. But if the refresh comes too late — even by a fraction of a second — Redis considers the lock gone. The next refresh attempt finds an empty spot and throws LockConflictedException.

The twist: crunz doesn’t refresh the lock reliably. It uses a probabilistic approach. More on that in a moment.

Redis Distributed Locks in PHP

A distributed lock with Redis ensures that only one process runs a critical section at a time — even across multiple servers. In PHP, the Symfony Lock component makes this straightforward. Here’s a basic Redis lock example:

use Symfony\Component\Lock\LockFactory;
use Symfony\Component\Lock\Store\RedisStore;

$redis = new \Predis\Client(['host' => 'localhost', 'port' => 6379]);
$store = new RedisStore($redis);
$factory = new LockFactory($store);

$lock = $factory->createLock('my-task', ttl: 300);

if ($lock->acquire()) {
    try {
        // critical section — only one process runs this
        doExpensiveWork();
    } finally {
        $lock->release();
    }
}

Under the hood, RedisStore uses ZADD with a unique token and a TTL. The lock lives as a Redis sorted set member — when the TTL expires, the member is removed and the lock is gone. For long-running tasks, you call $lock->refresh() to extend the TTL before it runs out.

This is exactly what crunz does. But its refresh mechanism has a problem.

Crunz: PHP’s Cron Job Scheduler

Crunz is a PHP cron job scheduler that lets you define scheduled tasks in code instead of crontab files. You write task definitions in PHP, and a single cron entry (* * * * * crunz schedule:run) executes them on their configured schedules.

$schedule->run('php artisan reports:generate')
    ->daily()
    ->at('03:00')
    ->preventOverlapping($redisStore);

That preventOverlapping() call is the key. It tells the crunz scheduler to acquire a distributed lock before running the task — so if the same crunz cron job is still running from the previous schedule, the new instance backs off instead of creating duplicates or race conditions.

The lock store can be anything Symfony Lock supports: Redis, filesystem (FlockStore), PostgreSQL, or Memcached. Our application uses Redis for 42 of its 43 scheduled tasks. And that’s where the trouble starts.

Tracing the Call Path

The exception comes from RedisStore::putOffExpiration() (in vendor/symfony/lock/Store/RedisStore.php). Inside that method, a Lua script performs an atomic check-and-extend — atomic because Redis executes Lua scripts as a single operation, preventing race conditions between the “does my lock still exist?” check and the “extend it” update:

if not redis.call("ZSCORE", key, uniqueToken) then
    return false  -- lock is gone, triggers LockConflictedException
end

When the script returns false, Symfony Lock throws the exception. The lock has already expired in Redis — our token was removed because the TTL ran out.

Tracing back through crunz’s source (under vendor/crunzphp/crunz/src/):

schedule_run.sh
  └─ crunz schedule:run
       └─ EventRunner::handle()
            └─ EventRunner::manageStartedEvents()
                 ├─ 10% chance: Event::refreshLock()
                 │    └─ Lock::refresh()
                 │         └─ RedisStore::putOffExpiration()  ← throws here
                 └─ usleep(250000)  // 250ms between iterations

The lock is created with $ttl = 30 seconds — hardcoded in Event::createLockObject(). The EventRunner monitors running tasks in a loop, checking every 250ms. But it doesn’t refresh the lock on every iteration. It rolls the dice.

The Refresh Gamble

Here’s where it gets interesting. Crunz’s EventRunner doesn’t refresh locks deterministically. It uses a two-layer probabilistic gate:

Layer 1 — The dice roll. Each loop iteration (every 250ms), there’s only a 10% chance of even attempting a refresh:

// EventRunner::manageStartedEvents()
if (mt_rand(1, 100) <= 10) {
    $event->refreshLock();
}

Layer 2 — The time check. Even when selected, refreshLock() only acts when less than 15 seconds remain on the lock:

// Event::refreshLock()
$remainingLifetime = $this->lock->getRemainingLifetime();
if ($remainingLifetime < 15) {
    $this->lock->refresh();
}

Back to our parking meter: imagine you need to feed it every 30 seconds. But instead of watching the clock, you roll a ten-sided die every quarter second. Only on a 1 do you walk over and check. And even then, you only actually insert a coin if the meter shows less than 15 seconds remaining.

Most of the time, this works fine. But “most of the time” isn’t “always.” This isn’t a race condition in the traditional sense — it’s a probability game.

Here’s what a failed lock refresh looks like on a timeline:

gantt title Lock Lifecycle — Failed Refresh (30s TTL) dateFormat ss axisFormat %Ss section TTL Lock alive (safe zone) :active, safe, 00, 15s Refresh window (< 15s left) :crit, danger, 15, 15s section Refresh Attempts Skip (90%) :done, s1, 15, 1s Skip (90%) :done, s2, 17, 1s Skip (90%) :done, s3, 19, 1s Skip (90%) :done, s4, 21, 1s Skip (90%) :done, s5, 24, 1s Skip (90%) :done, s6, 27, 1s section Outcome Lock expired — LockConflictedException :crit, exp, 30, 1s

Every 250ms in the danger zone, the dice roll skips the refresh. After 60 consecutive skips — the lock is gone.

The Math That Explains Everything

The lock starts with a 30-second TTL. The refresh only kicks in during the last 15 seconds — that’s the critical window.

In those 15 seconds, the loop runs 60 times (15s / 0.25s). Each iteration has a 10% chance of refreshing, which means a 90% chance of not refreshing.

In plain English: the lock gets 60 chances to be refreshed, but each chance has a 90% probability of being skipped. The probability that all 60 get skipped:

0.9^60 = 0.00179...

0.18% per lock operation. Roughly 1 in 550.

That sounds rare — until you consider scale. With 42 Redis-locked tasks running on various schedules throughout the day, many of them multiple times per hour, we’re looking at hundreds of lock operations daily. At 0.18% each, multiple daily failures aren’t a fluke. They’re the expected outcome.

Think of it like a loaded coin that lands heads 90% of the time. Flip it 60 times in a row — what are the odds of getting heads every single time? Just 0.18%. That’s the lock surviving. The other 99.82% of the time, at least one refresh gets through. But 0.18% of hundreds of daily operations adds up fast.

When the exception fires, the task aborts — no overlapping execution, no data corruption. It’s a fail-safe crash, not a silent corruption. But losing scheduled runs daily is still unacceptable.

The fix is obvious once you see the numbers — increase the TTL. Because the < 15 threshold is hardcoded, a longer TTL means the lock takes longer to drop into the danger zone, giving the probabilistic refresh more iterations to succeed:

TTL	Refresh Window	Iterations	P(miss all)	Failure Rate
30s	15s	60	0.9^60	~1 in 550
60s	45s	180	0.9^180	~1 in 10^8
300s	285s	1140	0.9^1140	effectively 0

At 60 seconds, the failure rate drops to roughly 1 in 100 million — practically zero. At 300 seconds, 0.9^1140 produces a number so small it’s effectively zero for any real-world scenario.

One important detail: this issue only affects TTL-based stores like RedisStore. File-based locking with FlockStore isn’t affected — those locks don’t expire on a timer. They’re released when the process ends. If you’ve seen LockConflictedException with Redis but never with filesystem locks, now you know why. (For more on Redis locking behavior in Symfony, see Redis Session Locking Pitfalls in Symfony — same locking mechanism, different context.)

Why Does Crunz Throw LockConflictedException?

This isn’t an isolated problem. AzuraCast — the open-source radio automation platform — hit the exact same exception (#5424, #5937, #7207). Same symptoms, same confusion, same root cause: short TTL combined with background workers that don’t refresh fast enough. The Symfony Lock repository has related discussions in #38541 and #31426. In crunz itself? Nothing. No issues, no documentation. The code is identical from v3.4.1 through v3.9.0.

How to Fix Redis Lock Expiration in Crunz

The hardcoded $ttl = 30 lives in Event::createLockObject(). My first instinct was to just change it to 300 and call it a day. But that felt wrong — other users might have legitimate reasons for a shorter TTL.

The better approach: make it configurable. Three changes to Event.php:

1. Add a property:

private int $lockTtl = 30;

2. Accept it in preventOverlapping():

public function preventOverlapping(object $store = null, int $lockTtl = 30)
{
    $this->lockTtl = $lockTtl;
    // ... existing logic
}

3. Use it in createLockObject():

$ttl = $this->lockTtl;  // was: $ttl = 30;

The default stays at 30 seconds — fully backward compatible. But now each task can opt into a longer TTL:

// Before: rolls the dice with 30s
$event->preventOverlapping($lockStore);

// After: 300s TTL, practically zero failure chance
$event->preventOverlapping($lockStore, 300);

I applied this via cweagans/composer-patches so it survives composer update:

{
    "extra": {
        "patches": {
            "crunzphp/crunz": {
                "Make preventOverlapping lock TTL configurable": "patches/crunz-increase-lock-ttl.patch"
            }
        }
    }
}

Updated all 42 Redis-locked tasks from preventOverlapping($lockStore) to preventOverlapping($lockStore, 300). The single FlockStore task — a translations import using filesystem locks — stayed untouched since it’s not affected.

A unit test confirms the behavior:

public function testDefaultTtlRemainsThirty(): void
{
    $event = new Event('test-mutex', 'php -v');
    $event->preventOverlapping();
    $this->assertSame(30, $event->getLockTtl());
}

public function testCustomTtlIsStored(): void
{
    $event = new Event('test-mutex', 'php -v');
    $event->preventOverlapping(null, 300);
    $this->assertSame(300, $event->getLockTtl());
}

Since deploying the patch across all 42 Redis-locked tasks: zero LockConflictedException in over two weeks of production traffic. Not one.

The Trade-off: Zombie Locks

A longer TTL isn’t free. If a worker dies hard — SIGKILL, OOM, or a server crash — the lock stays in Redis until the TTL expires. With a 300-second TTL, that means up to 5 minutes where no other process can acquire the lock, potentially blocking the next scheduled run.

For our use case, this trade-off is worth it. A 5-minute delay after a crash beats dozens of daily LockConflictedException failures during normal operation. If your tasks are short-lived (under 30 seconds) and crashes are your bigger concern, a TTL of 60 seconds gives you the best of both worlds — near-zero failure rate with only a 1-minute zombie lock window.

A more robust alternative: replace the probabilistic refresh entirely with a deterministic heartbeat — refresh unconditionally every N seconds. But that requires patching the EventRunner loop itself, not just the TTL parameter.

“Sporadic” is a dangerous word in bug reports. It sounds like “rare” and “unpredictable,” which your brain translates to “not worth investigating.” But sporadic often just means the probability is low enough to look random — while being high enough to guarantee occurrence at scale. Next time an exception shows up without a pattern, don’t reach for the retry logic. Reach for a calculator.