How to Monitor Celery Beat Schedules (And Catch Missed Jobs)

Sluice Team8 min read

How to Monitor Celery Beat Schedules (And Catch Missed Jobs)

Celery Beat is one of those processes that runs so quietly you forget it exists -- until the day it stops, and you discover that your nightly billing reconciliation hasn't fired in three days. Nobody noticed because Beat doesn't raise an alarm when it dies. It just stops scheduling, and the silence looks exactly the same as everything working correctly.

This is a known gap in the Celery ecosystem. Flower won't tell you Beat is down. celery-exporter doesn't monitor Beat. Neither does Leek or Kanchi. If your periodic tasks stopped running an hour ago, the most common way to find out is when a customer asks why their daily report never arrived.

This guide covers how Celery Beat works under the hood, the failure modes that catch production teams off guard, and the monitoring patterns that actually close the gap.


The Single-Process Problem

Celery Beat runs as a single process. Unlike Celery workers -- which you can scale horizontally across dozens of machines -- Beat is designed to have exactly one instance running at any given time. If you run two, you get duplicate task dispatches. If you run zero, every periodic task in your system goes silent.

That architectural constraint wouldn't be so dangerous if Beat had a built-in health check, an API endpoint, or even a way to signal that it's fallen behind schedule. It doesn't have any of those things. Beat either runs, or it doesn't, and there's no native mechanism to detect the difference from outside.

The failure mode is insidious because it's not noisy. A crashed worker leaves tasks in the broker queue, which creates visible backlog. A crashed Beat leaves... nothing. No queued tasks, no error messages, no trace. The absence of periodic tasks looks identical to "everything is fine but nothing is due yet." You have to already know what should be running to notice that it isn't.


How Celery Beat Actually Works

Before you can monitor Beat effectively, it helps to understand what it's actually doing on each tick of its internal loop.

Beat reads a schedule definition from one of three places: a static beat_schedule dictionary in your Celery config, a database table via django-celery-beat, or a Redis-backed store via Redbeat. Regardless of the storage backend, the runtime behavior is the same: Beat maintains an in-memory copy of the schedule, loops through it on a regular interval, and dispatches any tasks whose scheduled time has arrived.

Critically, Beat only dispatches tasks to the broker -- it doesn't execute them. When Beat fires a periodic task, it publishes a message to a queue exactly like task.delay() would. A worker still has to pick up that message, execute the function, and report the result. This means Beat monitoring and worker monitoring are two separate problems. Beat can be running perfectly while every task it dispatches rots in a queue because all the workers consuming that queue are down.

Schedule Persistence

By default, Beat persists its schedule state to a local celerybeat-schedule file using Python's shelve module. This file tracks when each task was last dispatched, so Beat doesn't re-fire everything from scratch after a restart. The file format is fragile -- shelve databases are platform-specific and version-sensitive, and corruption is a documented cause of Beat misbehavior.

django-celery-beat moves schedule persistence to a relational database, which makes it possible to add, modify, and remove periodic tasks through Django admin without restarting Beat. Redbeat does the same thing with Redis as the backend, and adds a distributed lock mechanism that makes it safer (though not trivial) to run Beat in an HA configuration.


Beat Failure Modes

There are four categories of Beat failure, and they all look like the same thing from the outside: tasks stop running.

1. Beat Process Crashes Silently

The most straightforward failure: the Beat process exits, and nobody notices. This can happen from an unhandled exception during schedule evaluation, a misconfigured crontab expression that throws a parse error, or a database connection failure in django-celery-beat when Beat tries to reload the schedule.

Without process supervision (systemd, Supervisor, or a Kubernetes liveness probe), Beat stays dead until someone manually restarts it. Even with process supervision, the restart might not work if the underlying issue persists -- for example, a corrupted celerybeat-schedule file that causes Beat to crash on startup in a loop.

2. Beat Is Running but Not Dispatching

This is the more deceptive failure, because Beat's logs will show that the scheduler is active and healthy. The process is alive, the PID file exists, and a basic liveness check would pass -- but no tasks are actually being dispatched.

Common causes include timezone mismatches between Beat and workers (UTC vs. local time, or a server that drifted after a bad NTP sync), celerybeat-schedule file corruption that causes Beat to think every task was just dispatched, and multiple Beat instances dispatching duplicate tasks -- vanilla Celery Beat has no distributed lock, so running two instances means every periodic task fires twice. (Redbeat solves this with Redis-based locking; django-celery-beat uses row-level SELECT FOR UPDATE locking during schedule reads, but this doesn't prevent multiple Beat instances from dispatching independently.)

3. Beat Dispatches but Tasks Never Execute

Beat might be perfectly healthy and dispatching on schedule, while none of the tasks actually run. This happens when tasks are routed to a queue that no worker is consuming (a queue name mismatch between Beat config and worker startup flags), when all workers consuming the target queue are down, or when the broker itself has hit a memory limit and is silently dropping or refusing messages.

This failure mode is especially tricky because Beat-level monitoring will show everything is fine. The problem lives downstream in the broker or worker layer. The only way to catch it is to monitor the full pipeline from dispatch to completion.

4. Schedule Drift

System clock changes, NTP drift, and DST transitions can cause Beat's internal schedule to misalign with wall-clock time. Crontab-based schedules are particularly vulnerable -- a DST transition can cause a task to fire twice or not at all during the boundary hour. django-celery-beat with very short intervals (under one minute) may also drift under load, though the specific mechanism isn't well-documented.


The Dead Man's Switch Pattern

The most reliable way to monitor Celery Beat isn't to check whether the process is running -- it's to verify that the entire pipeline is working by proving that tasks dispatched by Beat actually execute successfully.

The pattern is called a "dead man's switch," and it works by adding a lightweight heartbeat task to your Beat schedule:

# In your Celery config
app.conf.beat_schedule = {
    'beat-heartbeat': {
        'task': 'myapp.tasks.beat_heartbeat',
        'schedule': 300.0,  # Every 5 minutes
    },
    # ... your other periodic tasks
}

@app.task(soft_time_limit=10)
def beat_heartbeat():
    """Prove that Beat is alive and tasks are executing."""
    requests.post(
        'https://your-monitor.example.com/heartbeat/beat',
        timeout=5,
    )

An external service -- Cronitor, Better Stack, Sentry Crons, or even a simple endpoint you build yourself -- expects a ping every five minutes. If the ping stops arriving, something in the pipeline is broken: Beat is dead, the broker is unreachable, the workers are down, or the heartbeat task is failing. The external service fires an alert, and you investigate.

This pattern is elegant because it monitors the entire chain -- scheduler, broker, and worker -- with a single signal. A process-level health check can only confirm that Beat's PID exists. The dead man's switch confirms that Beat is scheduling, the broker is accepting, and a worker is executing. That's a meaningfully stronger guarantee.

The downside is that it can't tell you which link in the chain broke. If the heartbeat stops, you still need to diagnose whether it was Beat, the broker, or the workers. But at least you know something is wrong within five minutes, rather than finding out from a customer three days later.


What's Available Today

Here's an honest look at the existing tools and what they actually cover:

ToolWhat It MonitorsApproachLimitations
Sentry CronsBeat schedule adherenceAuto-discovers Beat tasks via monitor_beat_tasks=True on the Sentry Celery integrationCrons is schedule-focused -- Sentry's broader Celery integration does capture task errors and performance, but queue depth and worker health are out of scope
CronitorTask execution windowscronitor.celery.initialize() auto-discovers and wraps Beat tasksMonitors execution timeframes but not queue depth or worker health; usage-based pricing from $2/monitor/mo
Custom healthcheckProcess livenessHTTP endpoint on the Beat process, or PID file check via systemdOnly proves the process is alive, not that it's dispatching correctly
django-celery-beat adminSchedule configurationDatabase-backed schedule visible in Django adminNo monitoring at all -- just CRUD for schedule definitions

Sentry Crons is the closest thing to a turnkey Beat monitoring solution today. It hooks into Celery's signal system to detect when Beat dispatches periodic tasks and compares the actual dispatch cadence against the expected schedule. If a task doesn't fire within its expected window, Sentry flags it as a missed check-in. The limitation is scope: Sentry Crons is purely a scheduler monitor, not a Celery monitoring tool. It won't show you queue depth, worker health, task duration trends, or failure details.

Celery Beat runs as a single process with no built-in high availability -- if it dies, all periodic tasks stop with no notification. No Celery monitoring tool (Flower, celery-exporter, Leek, or Kanchi) monitors Beat schedule adherence. Sentry Crons and Cronitor are the two third-party tools that auto-discover Beat tasks, though through different mechanisms -- Sentry via Celery signals, Cronitor by wrapping task execution.


Building Your Own Beat Monitoring

If you want to monitor Beat without adopting a new SaaS tool, here's a practical setup that covers the most dangerous gaps:

1. Dead man's switch (see above). Add the heartbeat task to your Beat schedule and wire it to whatever alerting system you already use -- PagerDuty, Opsgenie, or even a Slack webhook that posts to an on-call channel. This is the highest-value, lowest-effort step.

2. Process supervision with auto-restart. Run Beat under systemd or Supervisor with autorestart=true. In Kubernetes, use a liveness probe that checks Beat's PID file or log output. This handles the "Beat crashed and stayed dead" failure mode, though it won't help with the subtler cases where Beat is alive but not dispatching.

3. Schedule comparison logging. Log every task that Beat dispatches (the task_prerun signal on workers, or a custom scheduler subclass that logs on dispatch, can help here — note that beat_init and beat_embedded_init only fire once at Beat startup, not per dispatch), then compare the dispatch log against the expected schedule. If your send-daily-report task was supposed to fire at 06:00 UTC and the dispatch log has no entry for it by 06:05, that's worth an alert.

4. Queue depth monitoring. Use your broker's native tools (redis-cli LLEN for Redis, rabbitmqctl list_queues for RabbitMQ) to track queue lengths over time. A queue that normally hovers around zero and suddenly starts growing means workers aren't keeping up -- or aren't running at all. A queue that's always at zero when it shouldn't be might mean Beat isn't dispatching to it.

The dead man's switch pattern is the most reliable way to monitor Celery Beat: a heartbeat task pings an external service every N minutes, and the external service alerts if the ping stops. That single pattern catches the majority of Beat failures -- the other three steps add defense-in-depth for the edge cases.


django-celery-beat vs. Redbeat vs. Static Config

Your choice of schedule storage backend affects both the failure modes you'll encounter and how easy they are to monitor.

ApproachStrengthsWeaknessesMonitoring Implications
Static config (beat_schedule)Simple, version-controlled, no external dependenciesRequires a deploy to change any scheduleEasiest to monitor -- the schedule is in code, so expected vs. actual comparisons are straightforward
django-celery-beatDynamic schedule via Django admin, database-backed persistenceAdds a database dependency, lock contention with multiple Beat instances, migration overheadSchedule lives in the database -- monitoring requires querying the PeriodicTask table to know what should be running
RedbeatRedis-backed, distributed lock for HA-capable deployments, resilient to Beat restartsRedis dependency, smaller ecosystem, less community toolingMost resilient to restarts because the lock mechanism prevents duplicate dispatching, but monitoring needs Redis visibility

For most teams, static config is the right starting point. It's the simplest to reason about, the simplest to monitor, and the only option where your schedule definition lives in version control alongside your code. The constraint of needing a deploy to change a schedule is often a feature, not a bug -- it means schedule changes go through code review.

django-celery-beat makes sense when you genuinely need non-developers (or at least non-deployers) to modify schedules at runtime. Be aware that running multiple Beat instances against the same django-celery-beat database is a known source of duplicate dispatching and lock contention. The django-celery-beat docs are clear about this, but it catches people in Kubernetes deployments where pod scaling might accidentally spin up a second Beat.

Redbeat is purpose-built for environments where Beat needs to survive restarts and node failures without missing or duplicating dispatches. Its Redis-based locking means you can run multiple Beat instances safely, with only one acquiring the lock and acting as the active scheduler. It's the best option for high-availability requirements, though it ties your scheduler's health to your Redis cluster's health.


Where Sluice Fits

We want to be direct about where Sluice stands on Beat monitoring today. Celery Beat schedule monitoring is on Sluice's V1 roadmap -- it is not part of V0.

The planned approach follows the dead man's switch pattern: you define expected schedules in Sluice, and Sluice alerts when tasks don't arrive within their expected windows. The V1 release will include a dedicated schedule view that visualizes expected vs. actual dispatch cadence, making it immediately obvious when a periodic task has gone quiet.

What Sluice can do today, in V0, is monitor the execution side of periodic tasks. Because Sluice tracks every task that runs through your Celery workers -- including tasks dispatched by Beat -- you can search for a specific task name and check its execution frequency. If your generate-daily-report task normally runs once per day and suddenly has no executions in the last 24 hours, that's visible in Sluice's task history even without dedicated Beat support.

That's not a substitute for proper schedule monitoring, but it does close the "we had no idea for three days" gap. Combined with the dead man's switch pattern described above, you can get solid Beat coverage today while we build the purpose-built schedule monitoring for V1.

For more on what Sluice monitors and where it fits in the Celery tooling landscape, see Why We Built Sluice and our comparisons against Flower, Grafana + Prometheus, and running with no monitoring at all.


FAQ

Can I run multiple Celery Beat instances?

With the default file-based scheduler, no -- running multiple instances will cause duplicate task dispatching, because each instance maintains its own schedule state and they'll both fire the same tasks. Redbeat is the standard solution for multi-instance Beat deployments: it uses a Redis-based distributed lock so that only one instance acts as the active scheduler at any time. django-celery-beat uses row-level locking (SELECT FOR UPDATE) during schedule reads, but this isn't a distributed scheduling lock — it won't prevent multiple Beat instances from dispatching independently. Third-party packages like django-celerybeat-lock add cache-based locking for that purpose.

How do I restart Beat without missing scheduled tasks?

Beat persists its "last run" timestamps to the schedule store (file, database, or Redis depending on your backend). When it starts back up, it checks what's overdue and dispatches any tasks that should have fired during the downtime. In practice this works well for interval-based schedules but can behave unexpectedly with crontab schedules -- if Beat was down for several hours, it won't retroactively fire every missed cron window, only the most recent one.

What's the difference between crontab and interval schedules?

Interval schedules (schedule(timedelta(minutes=5))) fire at a fixed interval relative to the last dispatch time. With the default PersistentScheduler, the last dispatch timestamp is persisted to the shelve file, so an overdue task fires immediately on restart — the timer doesn't reset. (If the celerybeat-schedule file is missing or you use the non-persistent Scheduler, then the timer does start fresh.) Crontab schedules (crontab(hour=6, minute=0)) fire at absolute wall-clock times. Crontab is better for "this must run at 6am every day" use cases; interval is better for "this should run roughly every five minutes."

How do I monitor Beat in Kubernetes?

Run Beat as a Deployment with replicas: 1 (never more than one, unless you're using Redbeat). Add a liveness probe that checks for a recent log line or heartbeat file -- Beat doesn't expose an HTTP endpoint by default, so a simple exec probe like test -f /tmp/celerybeat-heartbeat && find /tmp/celerybeat-heartbeat -mmin -5 works if you have your heartbeat task touch that file. For restart detection, combine the liveness probe with the dead man's switch pattern described above.

Does Sluice monitor Celery Beat?

Not yet -- Beat schedule monitoring is on the V1 roadmap. Today, Sluice monitors task execution across your Celery workers, which means you can see whether periodic tasks are actually running by checking their execution history. Pair that with the dead man's switch pattern for a solid interim solution. For the latest on V1 progress, check our roadmap.


Running Celery in production and want to understand what's happening inside your workers? Sluice monitors task execution, queue depth, and worker health -- with Beat schedule monitoring coming in V1. For more on Celery's quirks in production, see our deep dives on why PENDING doesn't mean what you think and debugging task failures.