Celery Worker OOM Detection: How to Catch Memory Kills Before They Cascade

There's a particular flavor of production incident that makes your stomach drop: you didn't get an alert, nobody saw an exception, the worker just... stopped. No traceback in Sentry, no task_failure signal, no log line at all. The task is frozen in STARTED state like nothing happened. But something very much did happen — Linux's OOM killer decided your Celery worker was consuming too much memory and terminated it with SIGKILL, a signal so brutal that no cleanup handler, no exception hook, no graceful shutdown logic gets to run.

You probably found out about it the worst way possible — a customer reporting stale data, a queue that silently backed up for hours, or a colleague asking why a critical nightly job never finished. This post walks through why OOM kills happen, why they're invisible to Celery's normal error handling, and what to do about them.

The Invisible Killer

When Linux's OOM killer terminates a Celery worker, the task_failure signal never fires — the task remains stuck in STARTED state with no traceback or error message. Here's the sequence:

A worker picks up a task and records it as STARTED (assuming task_track_started=True, which is off by default).
The task allocates memory — loading a dataset, processing an image, accumulating API results.
The system runs low on memory. The OOM killer selects the biggest offender and sends SIGKILL.
The worker process evaporates. No cleanup, no signal, no log line.
The task stays in STARTED indefinitely — or reverts to PENDING when the result TTL expires, which really means "unknown."

If you don't have task_track_started enabled, you might not even know the task was picked up.

Why Celery Workers Get OOM-Killed

Memory pressure in Celery workers usually comes from a few predictable places.

Large task arguments. Passing megabytes of JSON or base64-encoded files as task args means that data lives in worker memory for the entire execution. With the prefork pool, it gets serialized and deserialized per worker process. Pass references (database IDs, S3 keys) instead.

Database result accumulation. A task that loads an entire SELECT * result set into a Python list can consume gigabytes. SQLAlchemy's default is to buffer all rows before you iterate — use .yield_per() or server-side cursors to stream in chunks.

C extension memory leaks. NumPy, Pandas, and Pillow allocate through C's malloc, outside Python's garbage collector. These allocations are invisible to gc.collect() but show up in RSS.

The fork bomb effect. Celery's prefork pool forks the parent process. If the parent uses 500MB, each child starts with a COW mapping of that 500MB. Python's reference counting mutates object headers constantly, triggering COW faults and inflating real memory. Four workers from a 500MB parent can easily hit 2GB.

Image/file processing and long-running tasks. A 10MB JPEG decompresses to 100MB+ of raw pixel data — JPEG compression ratios are typically 10:1 or higher, so even moderate files balloon in memory. Tasks that hold resources for hours accumulate memory that's never freed until completion.

Celery's Built-In Memory Controls

Celery ships with two relevant settings. They help with gradual leaks but can't prevent a single-task memory spike.

app.conf.update(
    worker_max_memory_per_child=400_000,  # Replace worker child after task if RSS exceeds ~390MB (KiB)
    worker_max_tasks_per_child=1000,       # Recycle worker after 1000 tasks
)

Celery's worker_max_memory_per_child setting only checks memory after each task completes. It cannot prevent an OOM kill that happens during task execution. If a single task allocates 8GB, the OOM killer strikes long before the post-task check runs. There's also a Linux subtlety: RSS includes shared COW pages, so reported memory can be misleadingly high — potentially triggering unnecessary recycling.

worker_max_tasks_per_child recycles workers after N tasks regardless of memory. It's effective for imperceptible C extension leaks, but each recycle forks a new process and re-imports your app. Set it too low and you thrash; too high and leaks accumulate.

Neither setting addresses the core problem: a task that allocates too much in a single execution.

Detecting OOM Kills

Since the worker can't report its own death, detection has to come from outside.

Check system logs

The OOM killer always leaves evidence in the kernel log:

dmesg | grep -i "oom\|killed process"
journalctl -k | grep -i oom

In Kubernetes, OOM kills surface as exit code 137 (128 + SIGKILL):

kubectl describe pod <celery-worker-pod> | grep -A5 "Last State"
# Reason: OOMKilled, Exit Code: 137

Monitor worker heartbeats

The most reliable way to detect OOM-killed Celery workers is heartbeat monitoring — missing heartbeats indicate a dead worker regardless of the cause of death. Workers send heartbeats every 2 seconds by default (worker_heartbeat_interval). When they stop, something is wrong — OOM kill, hardware failure, network partition. Silence is the signal.

The catch: Celery emits heartbeat events but doesn't monitor them itself. You need an external system watching for gaps. Tools like Flower show you live worker state, but they don't alert on missing heartbeats — you have to be looking at the dashboard at the right moment.

Track stalled tasks

Tasks stuck in STARTED beyond their expected duration are likely victims of a killed worker. This requires task_track_started=True (off by default). Compare time-in-STARTED against task_time_limit — if it exceeds the limit without a failure event, the worker was killed externally. If you're also running scheduled tasks, keep an eye on your Beat schedules — a dead worker can silently break periodic jobs too.

Container memory limits

resources:
  limits:
    memory: "512Mi"  # Clean exit code 137 if exceeded
  requests:
    memory: "256Mi"

Container limits give you predictability that the Linux OOM killer doesn't. Instead of the kernel choosing a victim heuristically, the runtime guarantees this process dies when its limit is breached.

Prevention Strategies

1. Set memory limits at both layers — worker_max_memory_per_child for gradual leaks, container limits for spikes.

2. Chunk large data — server-side cursors for DB queries, fixed-size chunks for file processing, pagination for API responses.

3. Use acks_late with reject_on_worker_lost — the most important Celery-specific mitigation:

@app.task(acks_late=True, reject_on_worker_lost=True)
def process_large_dataset(dataset_id):
    # If this worker gets OOM-killed, the task is requeued
    ...

Without acks_late, Celery acknowledges the task on receipt — if the worker dies, the task is gone forever. With both flags, the broker requeues the message so another worker picks it up.

Warning: reject_on_worker_lost has no retry counter at the broker level. If the task deterministically OOMs — because the dataset is genuinely too large for available memory — it will be requeued, OOM-kill the next worker, be requeued again, and loop infinitely. Pair this with task_time_limit to break the cycle, and ensure the task can succeed on at least one worker class before enabling requeue.

4. Profile memory — tracemalloc for Python allocations, psutil.Process().memory_info().rss for C extension leaks.

5. Separate queues by resource — route memory-heavy tasks to dedicated workers with higher limits.

6. Monitor trends — rising RSS over time means a leak; sudden spikes point to a specific task type.

How Monitoring Closes the Gap

The core problem with OOM kills is the detection gap — the time between the worker dying and someone noticing. Without monitoring, that gap is hours or days. You find out from customer complaints, not alerts.

Heartbeat monitoring collapses that gap to seconds. When a worker stops sending heartbeats, Sluice marks it as offline and surfaces the tasks it was processing — you see the dead worker and its stalled tasks in a single view instead of piecing the story together from dmesg and Redis key scans. If you're running Celery without monitoring today, that gap is costing you more than you think.

Stalled task detection is the other half. Even after catching the dead worker, you need to know which tasks were affected and whether reject_on_worker_lost requeued them — or whether you need to intervene manually.

FAQ

What exit code means OOM kill?

Exit code 137 (128 + 9). In Kubernetes, the pod shows Reason: OOMKilled.

Does `worker_max_memory_per_child` prevent OOM?

Not during execution. It checks memory after each task completes, so it catches gradual leaks but can't stop a single task from spiking.

How do I profile Celery worker memory?

tracemalloc for Python allocations, memory_profiler for line-by-line traces, and psutil for total RSS including C extensions.

Should I use prefork or gevent?

Prefork uses more memory (separate processes) but gives true parallelism for CPU work. Gevent uses green threads in a single process — dramatically less memory if your tasks are I/O-bound. If the fork bomb effect is your problem, gevent can help. If a single task is allocating too much, neither pool type saves you.

What happens to the task when a worker is OOM-killed?

With acks_late=True and reject_on_worker_lost=True, it's requeued. Without those flags, the task sits in STARTED until the result TTL expires and silently becomes PENDING — meaning no one knows it failed.

OOM kills are the failure mode that separates teams who run Celery from teams who operate it. The difference is detection speed — knowing within seconds that a worker died, which tasks were on it, and whether they were safely requeued. If you're finding out about dead workers from customer complaints, that gap is worth closing. For more on debugging task failures and why we built Sluice, see the rest of the blog.